LS-EEND CoreML Models
CoreML exports of LS-EEND, a long-form streaming end-to-end neural diarization model with online attractor extraction.
This repository contains non-quantized CoreML step models for four LS-EEND variants:
AMICALLHOMEDIHARD IIDIHARD III
These models are intended for stateful streaming inference. Each package runs one LS-EEND step at a time with explicit recurrent/cache tensors, rather than processing an entire utterance in a single call.
Included files
Each variant directory contains:
*.mlpackage: the CoreML model package*.json: metadata needed by the runtime*.mlmodelc: a compiled CoreML bundle generated locally for convenience
Variant directories:
AMI/CALLHOME/DIHARD II/DIHARD III/
Variants
| Variant | Package | Configured max speakers | Model output capacity |
|---|---|---|---|
| AMI | AMI/ls_eend_ami_step.mlpackage |
4 | 6 |
| CALLHOME | CALLHOME/ls_eend_callhome_step.mlpackage |
7 | 9 |
| DIHARD II | DIHARD II/ls_eend_dih2_step.mlpackage |
10 | 12 |
| DIHARD III | DIHARD III/ls_eend_dih3_step.mlpackage |
10 | 12 |
The metadata JSON distinguishes between:
max_speakers: the dataset/config speaker setting from the LS-EEND infer YAMLmax_nspks: the exported model's full decode/output capacity
Frontend and runtime assumptions
All four non-quantized exports in this repo use the same frontend settings:
- sample rate:
8000 Hz - window length:
200samples - hop length:
80samples - FFT size:
1024 - mel bins:
23 - context receptive field:
7 - subsampling:
10 - feature type:
logmel23_cummn - output frame rate:
10 Hz - compute precision:
float32
These are step-wise streaming models. A runtime must maintain and feed the recurrent state tensors between calls:
enc_ret_kvenc_ret_scaleenc_conv_cachedec_ret_kvdec_ret_scaletop_buffer
The CoreML inputs and outputs follow the LS-EEND step export used by the reference Python and Swift runtimes.
Intended usage
Use these packages with a runtime that:
- Resamples audio to mono
8 kHz - Extracts LS-EEND features with the settings above
- Preserves model state across step calls
- Uses
ingest/decodecontrol inputs to handle the encoder delay and final tail flush - Applies postprocessing such as sigmoid, thresholding, optional median filtering, and RTTM conversion outside the CoreML graph
This repository is not a drop-in replacement for generic Hugging Face transformers inference. It is meant for custom CoreML runtimes, such as:
- the Python LS-EEND CoreML runtime from the FS-EEND project
- the Swift/macOS runtime used for the LS-EEND CoreML microphone demo
Minimal metadata example
Each variant ships a sidecar JSON with fields like:
{
"sample_rate": 8000,
"win_length": 200,
"hop_length": 80,
"n_fft": 1024,
"n_mels": 23,
"context_recp": 7,
"subsampling": 10,
"feat_type": "logmel23_cummn",
"frame_hz": 10.0,
"max_speakers": 10,
"max_nspks": 12
}
Check the variant-specific *.json file for the exact state tensor shapes and output dimensions.
Source project
These CoreML exports were produced from the LS-EEND code in the FS-EEND repository:
- GitHub: Audio-WestlakeU/FS-EEND
The export path is based on the LS-EEND CoreML step exporter and variant batch exporter in that project.
Training and evaluation context
From the source project, the reported real-world diarization error rates are:
| Dataset | DER (%) |
|---|---|
| CALLHOME | 12.11 |
| DIHARD II | 27.58 |
| DIHARD III | 19.61 |
| AMI Dev | 20.97 |
| AMI Eval | 20.76 |
These numbers come from the upstream LS-EEND project README and reflect the original training/evaluation setup, not a Hugging Face evaluation pipeline.
Limitations
- These models are exported for Apple CoreML runtimes, not for PyTorch or ONNX consumers.
- They are stateful streaming step models, so they require a custom driver loop.
- They assume an 8 kHz LS-EEND frontend and will not produce matching results if you use a different spectrogram pipeline.
- Speaker identities are output as activity tracks/slots and still require downstream diarization postprocessing and speaker-slot alignment where appropriate.
License and dataset constraints
The upstream LS-EEND model/codebase used for these CoreML exports is MIT-licensed, and this repository is published as MIT accordingly.
The underlying evaluation and fine-tuning datasets still have their own access and usage terms:
- AMI
- CALLHOME
- DIHARD II
- DIHARD III
This repository redistributes CoreML exports of the LS-EEND model variants. Dataset licensing and access requirements remain governed by the original dataset providers.
Citation
If you use LS-EEND, cite the original paper:
@ARTICLE{11122273,
author={Liang, Di and Li, Xiaofei},
journal={IEEE Transactions on Audio, Speech and Language Processing},
title={LS-EEND: Long-Form Streaming End-to-End Neural Diarization With Online Attractor Extraction},
year={2025},
volume={33},
pages={3568-3581},
doi={10.1109/TASLPRO.2025.3597446}
}
- Downloads last month
- -