STT-meta-HI — Hindi Speech Tagger

A Conformer-CTC model for Hindi automatic speech recognition with inline entity tagging, speaker attribute detection (age, gender, emotion), and intent classification — all in a single forward pass.

Model Details

Property	Value
Architecture	Conformer CTC (NeMo)
Encoder Layers	17
Hidden Size (d_model)	512
Attention Heads	8
Vocabulary Size	660
Input Features	80 mel-filterbanks
Sample Rate	16 kHz
Language	Hindi
Model Size	458.2 MB

Supported Tag Categories

AGE: AGE_0_18, AGE_18_30, AGE_30_45, AGE_45_60, AGE_60PLUS
EMOTION: EMOTION_ANGRY, EMOTION_FEAR, EMOTION_HAPPY, EMOTION_NEUTRAL, EMOTION_SAD, EMOTION_SURPRISE
ENTITY: 369 types (ENTITY_PERSON_NAME, ENTITY_CITY, ENTITY_ORGANIZATION, ...)
GENDER: GENDER_FEMALE, GENDER_MALE, GENDER_OTHER
INTENT: INTENT_ASSERT, INTENT_ASSERTION, INTENT_COMMAND, INTENT_DECLARATIVE, INTENT_EXCLAIM, INTENT_EXPLAIN, INTENT_GREETING, INTENT_INFORM, INTENT_INSTRUCT, INTENT_OFFER, INTENT_OPINION, INTENT_QUESTION, INTENT_REMEMBER, INTENT_REQUEST, INTENT_STATEMENT, INTENT_SUGGESTION, INTENT_THANK, INTENT_THANKING, INTENT_WARNING, INTENT_WISH
END: Delimiter token for entity span boundaries

Evaluation Results

ASR Performance

Dataset	WER	CER	Samples
Internal validation set	18.1%	6.8%	35993
FLEURS Hindi (test)	22.0%	—	418

Tag Classification

AGE

Accuracy: 42.6% | Macro F1: 12.0% | Weighted F1: 25.5%

Label	Precision	Recall	F1	Support
AGE_0_18	0.000	0.000	0.000	15
AGE_18_30	1.000	0.001	0.001	1921
AGE_30_45	0.159	0.001	0.002	10930
AGE_45_60	0.426	1.000	0.598	15313
AGE_60PLUS	0.000	0.000	0.000	7814

Confusion Matrix

	18_30	30_45	45_60
0_18	0	0	15
18_30	1	57	1863
30_45	0	13	10917
45_60	0	7	15306
60PLUS	0	5	7809

EMOTION

Accuracy: 47.6% | Macro F1: 22.4% | Weighted F1: 42.2%

Label	Precision	Recall	F1	Support
EMOTION_ANGRY	0.000	0.000	0.000	145
EMOTION_HAPPY	0.565	0.213	0.309	19184
EMOTION_NEUTRAL	0.453	0.825	0.585	15806
EMOTION_SAD	0.000	0.000	0.000	858

Confusion Matrix

	HAPPY	NEUTRAL
ANGRY	27	118
HAPPY	4080	15104
NEUTRAL	2769	13037
SAD	342	516

GENDER

Accuracy: 63.3% | Macro F1: 25.9% | Weighted F1: 49.1%

Label	Precision	Recall	F1	Support
GENDER_FEMALE	0.633	1.000	0.775	22778
GENDER_MALE	0.625	0.000	0.001	13184
GENDER_OTHER	0.000	0.000	0.000	31

Confusion Matrix

	FEMALE	MALE
FEMALE	22775	3
MALE	13179	5
OTHER	31	0

INTENT

Accuracy: 0.0% | Macro F1: 0.0% | Weighted F1: 0.0%

Label	Precision	Recall	F1	Support
INTENT_ASSERTION	0.000	0.000	0.000	1
INTENT_COMMAND	0.000	0.000	0.000	240
INTENT_EXPLAIN	0.000	0.000	0.000	5
INTENT_GREETING	0.000	0.000	0.000	2
INTENT_INFORM	0.000	0.000	0.000	35132
INTENT_QUESTION	0.000	0.000	0.000	55
INTENT_REQUEST	0.000	0.000	0.000	57
INTENT_THANK	0.000	0.000	0.000	1

Confusion Matrix

	ASSERTION	COMMAND	EXPLAIN	GREETING	INFORM	QUESTION	REQUEST	THANK	NONE
ASSERTION	0	0	0	0	0	0	0	0	1
COMMAND	0	0	0	0	0	0	0	0	240
EXPLAIN	0	0	0	0	0	0	0	0	5
GREETING	0	0	0	0	0	0	0	0	2
INFORM	0	0	0	0	0	0	0	0	35132
QUESTION	0	0	0	0	0	0	0	0	55
REQUEST	0	0	0	0	0	0	0	0	57
THANK	0	0	0	0	0	0	0	0	1
NONE	0	0	0	0	0	0	0	0	0

Entity Detection

Metric	Value
Precision	0.559
Recall	0.075
F1	0.132

Usage

from nemo.collections.asr.models import EncDecCTCModelBPE

# Load model
model = EncDecCTCModelBPE.restore_from("hi-vakyansh-meta-v17-223k.nemo")
model.eval()

# Transcribe
transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0])
# Example output: "ENTITY_PERSON_NAME नरेंद्र मोदी END ने कहा कि AGE_45_60 GENDER_MALE EMOTION_NEUTRAL INTENT_INFORM"

Extracting Tags

import re

text = transcriptions[0]

# Extract trailing tags
age = re.search(r"\b(AGE_\S+)\b", text)
gender = re.search(r"\b(GENDER_\S+)\b", text)
emotion = re.search(r"\b(EMOTION_\S+)\b", text)
intent = re.search(r"\b(INTENT_\S+)\b", text)

# Extract inline entities
entities = re.findall(r"(ENTITY_\S+)\s+(.*?)\s+END", text)
# [("ENTITY_PERSON_NAME", "नरेंद्र मोदी")]

# Get clean transcript (tags removed)
clean = re.sub(r"\b(?:AGE_\S+|GENDER_\S+|EMOTION_\S+|INTENT_\S+|ENTITY_\S+|END)\b", "", text)
clean = " ".join(clean.split())

Training

Base model: WhissleAI/STT-meta-1B (Conformer 600M, multilingual)
Fine-tuning data: 223K Hindi utterances with speech tags (entity, age, gender, emotion, intent annotations)
Optimizer: AdamW, cosine annealing LR schedule
Training: Full fine-tuning on NVIDIA A100 40GB

Citation

@misc{whissle-stt-meta-hi,
  title={STT-meta-HI: Hindi Speech Tagger with Entity Recognition},
  author={Whissle AI},
  year={2026},
  url={https://huggingface.co/WhissleAI/STT-meta-HI}
}

License

Apache 2.0

Downloads last month: 11

Evaluation results

WER
self-reported

0.181