Persona-ASR

Bilingual (Kazakh–English) target-speaker ASR for overlapping speech, from the Persona-ASR project. Given a three-speaker mixture and a short enrollment utterance, the model transcribes only the enrolled target speaker, and rejects the utterance (emitting <no_target>) when the target is absent.

Checkpoints

asr_backbone.pt — recognizer + activity-detection backbone: a frozen ECAPA-TDNN speaker embedding modulates a WavLM-Base+ encoder via FiLM, feeding two language-specific CTC heads (English/Latin, Kazakh/Cyrillic) and a frame-level VAD head.
presence_gate.pt — utterance-level target-presence gate (enrollment–mixture matching + speaker-conditioned attention + attentive-statistics pooling), applied on top of the frozen backbone.
config.json — backbone configuration (vocab sizes, hyperparameters).

Results (three-speaker test sets)

Test set	Raw WER	Gated WER	Detection BAcc
English (Libri3Mix-100h)	29.35	36.92	80.04
Kazakh (Kazakh3Mix-100h)	43.47	50.71	86.78

Presence-gate thresholds (calibrated on validation): τ_EN = 0.502, τ_KK = 0.586.

Training

Backbone loss L = 0.7·CTC + 0.3·VAD; gate trained with binary cross-entropy on the frozen backbone. Trained on LibriMix (English) and KazMix-3 (Kazakh); also evaluated on PersonaMix.

Usage

Inference and evaluation code: github.com/IS2AI/Persona_ASR.

License and citation

Released under CC BY 4.0. Please cite Persona-ASR.

@article{persona_asr,
  title = {Persona-ASR: Bilingual Target-Speaker Speech Recognition for Kazakh--English Overlapping Speech},
  year  = {2026}
}

Downloads last month: 13

Collection including issai/Persona-ASR

Persona-ASR: Target-Speaker Speech Recognition for Kk-En

Collection

Bilingual Kazakh–English target-speaker ASR for overlapping speech — checkpoints plus the KazMix-3 and PersonaMix datasets. The first Kazakh TS-ASR. • 3 items • Updated 3 days ago