Persona-ASR

Bilingual (Kazakh–English) target-speaker ASR for overlapping speech, from the Persona-ASR project. Given a three-speaker mixture and a short enrollment utterance, the model transcribes only the enrolled target speaker, and rejects the utterance (emitting <no_target>) when the target is absent.

Checkpoints

  • asr_backbone.pt — recognizer + activity-detection backbone: a frozen ECAPA-TDNN speaker embedding modulates a WavLM-Base+ encoder via FiLM, feeding two language-specific CTC heads (English/Latin, Kazakh/Cyrillic) and a frame-level VAD head.
  • presence_gate.pt — utterance-level target-presence gate (enrollment–mixture matching + speaker-conditioned attention + attentive-statistics pooling), applied on top of the frozen backbone.
  • config.json — backbone configuration (vocab sizes, hyperparameters).

Results (three-speaker test sets)

Test set Raw WER Gated WER Detection BAcc
English (Libri3Mix-100h) 29.35 36.92 80.04
Kazakh (Kazakh3Mix-100h) 43.47 50.71 86.78

Presence-gate thresholds (calibrated on validation): τ_EN = 0.502, τ_KK = 0.586.

Training

Backbone loss L = 0.7·CTC + 0.3·VAD; gate trained with binary cross-entropy on the frozen backbone. Trained on LibriMix (English) and KazMix-3 (Kazakh); also evaluated on PersonaMix.

Usage

Inference and evaluation code: github.com/IS2AI/Persona_ASR.

License and citation

Released under CC BY 4.0. Please cite Persona-ASR.

@article{persona_asr,
  title = {Persona-ASR: Bilingual Target-Speaker Speech Recognition for Kazakh--English Overlapping Speech},
  year  = {2026}
}
Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including issai/Persona-ASR