STT-meta-HI — Hindi Speech Tagger

A Conformer-CTC model for Hindi automatic speech recognition with inline entity tagging, speaker attribute detection (age, gender, emotion), and intent classification — all in a single forward pass.

Model Details

Property Value
Architecture Conformer CTC (NeMo)
Encoder Layers 17
Hidden Size (d_model) 512
Attention Heads 8
Vocabulary Size 660
Input Features 80 mel-filterbanks
Sample Rate 16 kHz
Language Hindi
Model Size 458.2 MB

Supported Tag Categories

  • AGE: AGE_0_18, AGE_18_30, AGE_30_45, AGE_45_60, AGE_60PLUS

  • EMOTION: EMOTION_ANGRY, EMOTION_FEAR, EMOTION_HAPPY, EMOTION_NEUTRAL, EMOTION_SAD, EMOTION_SURPRISE

  • ENTITY: 369 types (ENTITY_PERSON_NAME, ENTITY_CITY, ENTITY_ORGANIZATION, ...)

  • GENDER: GENDER_FEMALE, GENDER_MALE, GENDER_OTHER

  • INTENT: INTENT_ASSERT, INTENT_ASSERTION, INTENT_COMMAND, INTENT_DECLARATIVE, INTENT_EXCLAIM, INTENT_EXPLAIN, INTENT_GREETING, INTENT_INFORM, INTENT_INSTRUCT, INTENT_OFFER, INTENT_OPINION, INTENT_QUESTION, INTENT_REMEMBER, INTENT_REQUEST, INTENT_STATEMENT, INTENT_SUGGESTION, INTENT_THANK, INTENT_THANKING, INTENT_WARNING, INTENT_WISH

  • END: Delimiter token for entity span boundaries

Evaluation Results

ASR Performance

Dataset WER CER Samples
Internal validation set 18.1% 6.8% 35993
FLEURS Hindi (test) 22.0% 418

Tag Classification

AGE

Accuracy: 42.6% | Macro F1: 12.0% | Weighted F1: 25.5%

Label Precision Recall F1 Support
AGE_0_18 0.000 0.000 0.000 15
AGE_18_30 1.000 0.001 0.001 1921
AGE_30_45 0.159 0.001 0.002 10930
AGE_45_60 0.426 1.000 0.598 15313
AGE_60PLUS 0.000 0.000 0.000 7814
Confusion Matrix
0_18 18_30 30_45 45_60 60PLUS
0_18 0 0 0 15 0
18_30 0 1 57 1863 0
30_45 0 0 13 10917 0
45_60 0 0 7 15306 0
60PLUS 0 0 5 7809 0

EMOTION

Accuracy: 47.6% | Macro F1: 22.4% | Weighted F1: 42.2%

Label Precision Recall F1 Support
EMOTION_ANGRY 0.000 0.000 0.000 145
EMOTION_HAPPY 0.565 0.213 0.309 19184
EMOTION_NEUTRAL 0.453 0.825 0.585 15806
EMOTION_SAD 0.000 0.000 0.000 858
Confusion Matrix
ANGRY HAPPY NEUTRAL SAD
ANGRY 0 27 118 0
HAPPY 0 4080 15104 0
NEUTRAL 0 2769 13037 0
SAD 0 342 516 0

GENDER

Accuracy: 63.3% | Macro F1: 25.9% | Weighted F1: 49.1%

Label Precision Recall F1 Support
GENDER_FEMALE 0.633 1.000 0.775 22778
GENDER_MALE 0.625 0.000 0.001 13184
GENDER_OTHER 0.000 0.000 0.000 31
Confusion Matrix
FEMALE MALE OTHER
FEMALE 22775 3 0
MALE 13179 5 0
OTHER 31 0 0

INTENT

Accuracy: 0.0% | Macro F1: 0.0% | Weighted F1: 0.0%

Label Precision Recall F1 Support
INTENT_ASSERTION 0.000 0.000 0.000 1
INTENT_COMMAND 0.000 0.000 0.000 240
INTENT_EXPLAIN 0.000 0.000 0.000 5
INTENT_GREETING 0.000 0.000 0.000 2
INTENT_INFORM 0.000 0.000 0.000 35132
INTENT_QUESTION 0.000 0.000 0.000 55
INTENT_REQUEST 0.000 0.000 0.000 57
INTENT_THANK 0.000 0.000 0.000 1
Confusion Matrix
ASSERTION COMMAND EXPLAIN GREETING INFORM QUESTION REQUEST THANK NONE
ASSERTION 0 0 0 0 0 0 0 0 1
COMMAND 0 0 0 0 0 0 0 0 240
EXPLAIN 0 0 0 0 0 0 0 0 5
GREETING 0 0 0 0 0 0 0 0 2
INFORM 0 0 0 0 0 0 0 0 35132
QUESTION 0 0 0 0 0 0 0 0 55
REQUEST 0 0 0 0 0 0 0 0 57
THANK 0 0 0 0 0 0 0 0 1
NONE 0 0 0 0 0 0 0 0 0

Entity Detection

Metric Value
Precision 0.559
Recall 0.075
F1 0.132

Usage

from nemo.collections.asr.models import EncDecCTCModelBPE

# Load model
model = EncDecCTCModelBPE.restore_from("hi-vakyansh-meta-v17-223k.nemo")
model.eval()

# Transcribe
transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0])
# Example output: "ENTITY_PERSON_NAME नरेंद्र मोदी END ने कहा कि AGE_45_60 GENDER_MALE EMOTION_NEUTRAL INTENT_INFORM"

Extracting Tags

import re

text = transcriptions[0]

# Extract trailing tags
age = re.search(r"\b(AGE_\S+)\b", text)
gender = re.search(r"\b(GENDER_\S+)\b", text)
emotion = re.search(r"\b(EMOTION_\S+)\b", text)
intent = re.search(r"\b(INTENT_\S+)\b", text)

# Extract inline entities
entities = re.findall(r"(ENTITY_\S+)\s+(.*?)\s+END", text)
# [("ENTITY_PERSON_NAME", "नरेंद्र मोदी")]

# Get clean transcript (tags removed)
clean = re.sub(r"\b(?:AGE_\S+|GENDER_\S+|EMOTION_\S+|INTENT_\S+|ENTITY_\S+|END)\b", "", text)
clean = " ".join(clean.split())

Training

  • Base model: WhissleAI/STT-meta-1B (Conformer 600M, multilingual)
  • Fine-tuning data: 223K Hindi utterances with speech tags (entity, age, gender, emotion, intent annotations)
  • Optimizer: AdamW, cosine annealing LR schedule
  • Training: Full fine-tuning on NVIDIA A100 40GB

Citation

@misc{whissle-stt-meta-hi,
  title={STT-meta-HI: Hindi Speech Tagger with Entity Recognition},
  author={Whissle AI},
  year={2026},
  url={https://huggingface.co/WhissleAI/STT-meta-HI}
}

License

Apache 2.0

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results