STT-meta-HI — Hindi Speech Tagger
A Conformer-CTC model for Hindi automatic speech recognition with inline entity tagging, speaker attribute detection (age, gender, emotion), and intent classification — all in a single forward pass.
Model Details
| Property | Value |
|---|---|
| Architecture | Conformer CTC (NeMo) |
| Encoder Layers | 17 |
| Hidden Size (d_model) | 512 |
| Attention Heads | 8 |
| Vocabulary Size | 660 |
| Input Features | 80 mel-filterbanks |
| Sample Rate | 16 kHz |
| Language | Hindi |
| Model Size | 458.2 MB |
Supported Tag Categories
AGE: AGE_0_18, AGE_18_30, AGE_30_45, AGE_45_60, AGE_60PLUS
EMOTION: EMOTION_ANGRY, EMOTION_FEAR, EMOTION_HAPPY, EMOTION_NEUTRAL, EMOTION_SAD, EMOTION_SURPRISE
ENTITY: 369 types (ENTITY_PERSON_NAME, ENTITY_CITY, ENTITY_ORGANIZATION, ...)
GENDER: GENDER_FEMALE, GENDER_MALE, GENDER_OTHER
INTENT: INTENT_ASSERT, INTENT_ASSERTION, INTENT_COMMAND, INTENT_DECLARATIVE, INTENT_EXCLAIM, INTENT_EXPLAIN, INTENT_GREETING, INTENT_INFORM, INTENT_INSTRUCT, INTENT_OFFER, INTENT_OPINION, INTENT_QUESTION, INTENT_REMEMBER, INTENT_REQUEST, INTENT_STATEMENT, INTENT_SUGGESTION, INTENT_THANK, INTENT_THANKING, INTENT_WARNING, INTENT_WISH
END: Delimiter token for entity span boundaries
Evaluation Results
ASR Performance
| Dataset | WER | CER | Samples |
|---|---|---|---|
| Internal validation set | 18.1% | 6.8% | 35993 |
| FLEURS Hindi (test) | 22.0% | — | 418 |
Tag Classification
AGE
Accuracy: 42.6% | Macro F1: 12.0% | Weighted F1: 25.5%
| Label | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| AGE_0_18 | 0.000 | 0.000 | 0.000 | 15 |
| AGE_18_30 | 1.000 | 0.001 | 0.001 | 1921 |
| AGE_30_45 | 0.159 | 0.001 | 0.002 | 10930 |
| AGE_45_60 | 0.426 | 1.000 | 0.598 | 15313 |
| AGE_60PLUS | 0.000 | 0.000 | 0.000 | 7814 |
Confusion Matrix
| 0_18 | 18_30 | 30_45 | 45_60 | 60PLUS | |
|---|---|---|---|---|---|
| 0_18 | 0 | 0 | 0 | 15 | 0 |
| 18_30 | 0 | 1 | 57 | 1863 | 0 |
| 30_45 | 0 | 0 | 13 | 10917 | 0 |
| 45_60 | 0 | 0 | 7 | 15306 | 0 |
| 60PLUS | 0 | 0 | 5 | 7809 | 0 |
EMOTION
Accuracy: 47.6% | Macro F1: 22.4% | Weighted F1: 42.2%
| Label | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| EMOTION_ANGRY | 0.000 | 0.000 | 0.000 | 145 |
| EMOTION_HAPPY | 0.565 | 0.213 | 0.309 | 19184 |
| EMOTION_NEUTRAL | 0.453 | 0.825 | 0.585 | 15806 |
| EMOTION_SAD | 0.000 | 0.000 | 0.000 | 858 |
Confusion Matrix
| ANGRY | HAPPY | NEUTRAL | SAD | |
|---|---|---|---|---|
| ANGRY | 0 | 27 | 118 | 0 |
| HAPPY | 0 | 4080 | 15104 | 0 |
| NEUTRAL | 0 | 2769 | 13037 | 0 |
| SAD | 0 | 342 | 516 | 0 |
GENDER
Accuracy: 63.3% | Macro F1: 25.9% | Weighted F1: 49.1%
| Label | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| GENDER_FEMALE | 0.633 | 1.000 | 0.775 | 22778 |
| GENDER_MALE | 0.625 | 0.000 | 0.001 | 13184 |
| GENDER_OTHER | 0.000 | 0.000 | 0.000 | 31 |
Confusion Matrix
| FEMALE | MALE | OTHER | |
|---|---|---|---|
| FEMALE | 22775 | 3 | 0 |
| MALE | 13179 | 5 | 0 |
| OTHER | 31 | 0 | 0 |
INTENT
Accuracy: 0.0% | Macro F1: 0.0% | Weighted F1: 0.0%
| Label | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| INTENT_ASSERTION | 0.000 | 0.000 | 0.000 | 1 |
| INTENT_COMMAND | 0.000 | 0.000 | 0.000 | 240 |
| INTENT_EXPLAIN | 0.000 | 0.000 | 0.000 | 5 |
| INTENT_GREETING | 0.000 | 0.000 | 0.000 | 2 |
| INTENT_INFORM | 0.000 | 0.000 | 0.000 | 35132 |
| INTENT_QUESTION | 0.000 | 0.000 | 0.000 | 55 |
| INTENT_REQUEST | 0.000 | 0.000 | 0.000 | 57 |
| INTENT_THANK | 0.000 | 0.000 | 0.000 | 1 |
Confusion Matrix
| ASSERTION | COMMAND | EXPLAIN | GREETING | INFORM | QUESTION | REQUEST | THANK | NONE | |
|---|---|---|---|---|---|---|---|---|---|
| ASSERTION | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| COMMAND | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 240 |
| EXPLAIN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
| GREETING | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| INFORM | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 35132 |
| QUESTION | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 55 |
| REQUEST | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 57 |
| THANK | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| NONE | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Entity Detection
| Metric | Value |
|---|---|
| Precision | 0.559 |
| Recall | 0.075 |
| F1 | 0.132 |
Usage
from nemo.collections.asr.models import EncDecCTCModelBPE
# Load model
model = EncDecCTCModelBPE.restore_from("hi-vakyansh-meta-v17-223k.nemo")
model.eval()
# Transcribe
transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0])
# Example output: "ENTITY_PERSON_NAME नरेंद्र मोदी END ने कहा कि AGE_45_60 GENDER_MALE EMOTION_NEUTRAL INTENT_INFORM"
Extracting Tags
import re
text = transcriptions[0]
# Extract trailing tags
age = re.search(r"\b(AGE_\S+)\b", text)
gender = re.search(r"\b(GENDER_\S+)\b", text)
emotion = re.search(r"\b(EMOTION_\S+)\b", text)
intent = re.search(r"\b(INTENT_\S+)\b", text)
# Extract inline entities
entities = re.findall(r"(ENTITY_\S+)\s+(.*?)\s+END", text)
# [("ENTITY_PERSON_NAME", "नरेंद्र मोदी")]
# Get clean transcript (tags removed)
clean = re.sub(r"\b(?:AGE_\S+|GENDER_\S+|EMOTION_\S+|INTENT_\S+|ENTITY_\S+|END)\b", "", text)
clean = " ".join(clean.split())
Training
- Base model: WhissleAI/STT-meta-1B (Conformer 600M, multilingual)
- Fine-tuning data: 223K Hindi utterances with speech tags (entity, age, gender, emotion, intent annotations)
- Optimizer: AdamW, cosine annealing LR schedule
- Training: Full fine-tuning on NVIDIA A100 40GB
Citation
@misc{whissle-stt-meta-hi,
title={STT-meta-HI: Hindi Speech Tagger with Entity Recognition},
author={Whissle AI},
year={2026},
url={https://huggingface.co/WhissleAI/STT-meta-HI}
}
License
Apache 2.0
- Downloads last month
- 11
Evaluation results
- WERself-reported0.181