LS-EEND CoreML Models

CoreML exports of LS-EEND, a long-form streaming end-to-end neural diarization model with online attractor extraction.

This repository contains non-quantized CoreML step models for four LS-EEND variants:

AMI
CALLHOME
DIHARD II
DIHARD III

These models are intended for stateful streaming inference. Each package runs one LS-EEND step at a time with explicit recurrent/cache tensors, rather than processing an entire utterance in a single call.

Included files

Each variant directory contains:

*.mlpackage: the CoreML model package
*.json: metadata needed by the runtime
*.mlmodelc: a compiled CoreML bundle generated locally for convenience

Variant directories:

AMI/
CALLHOME/
DIHARD II/
DIHARD III/

Variants

Variant	Package	Configured max speakers	Model output capacity
AMI	`AMI/ls_eend_ami_step.mlpackage`	4	6
CALLHOME	`CALLHOME/ls_eend_callhome_step.mlpackage`	7	9
DIHARD II	`DIHARD II/ls_eend_dih2_step.mlpackage`	10	12
DIHARD III	`DIHARD III/ls_eend_dih3_step.mlpackage`	10	12

The metadata JSON distinguishes between:

max_speakers: the dataset/config speaker setting from the LS-EEND infer YAML
max_nspks: the exported model's full decode/output capacity

Frontend and runtime assumptions

All four non-quantized exports in this repo use the same frontend settings:

sample rate: 8000 Hz
window length: 200 samples
hop length: 80 samples
FFT size: 1024
mel bins: 23
context receptive field: 7
subsampling: 10
feature type: logmel23_cummn
output frame rate: 10 Hz
compute precision: float32

These are step-wise streaming models. A runtime must maintain and feed the recurrent state tensors between calls:

enc_ret_kv
enc_ret_scale
enc_conv_cache
dec_ret_kv
dec_ret_scale
top_buffer

The CoreML inputs and outputs follow the LS-EEND step export used by the reference Python and Swift runtimes.

Intended usage

Use these packages with a runtime that:

Resamples audio to mono 8 kHz
Extracts LS-EEND features with the settings above
Preserves model state across step calls
Uses ingest/decode control inputs to handle the encoder delay and final tail flush
Applies postprocessing such as sigmoid, thresholding, optional median filtering, and RTTM conversion outside the CoreML graph

This repository is not a drop-in replacement for generic Hugging Face transformers inference. It is meant for custom CoreML runtimes, such as:

the Python LS-EEND CoreML runtime from the FS-EEND project
the Swift/macOS runtime used for the LS-EEND CoreML microphone demo

Minimal metadata example

Each variant ships a sidecar JSON with fields like:

{
  "sample_rate": 8000,
  "win_length": 200,
  "hop_length": 80,
  "n_fft": 1024,
  "n_mels": 23,
  "context_recp": 7,
  "subsampling": 10,
  "feat_type": "logmel23_cummn",
  "frame_hz": 10.0,
  "max_speakers": 10,
  "max_nspks": 12
}

Check the variant-specific *.json file for the exact state tensor shapes and output dimensions.

Source project

These CoreML exports were produced from the LS-EEND code in the FS-EEND repository:

GitHub: Audio-WestlakeU/FS-EEND

The export path is based on the LS-EEND CoreML step exporter and variant batch exporter in that project.

Training and evaluation context

From the source project, the reported real-world diarization error rates are:

Dataset	DER (%)
CALLHOME	12.11
DIHARD II	27.58
DIHARD III	19.61
AMI Dev	20.97
AMI Eval	20.76

These numbers come from the upstream LS-EEND project README and reflect the original training/evaluation setup, not a Hugging Face evaluation pipeline.

Limitations

These models are exported for Apple CoreML runtimes, not for PyTorch or ONNX consumers.
They are stateful streaming step models, so they require a custom driver loop.
They assume an 8 kHz LS-EEND frontend and will not produce matching results if you use a different spectrogram pipeline.
Speaker identities are output as activity tracks/slots and still require downstream diarization postprocessing and speaker-slot alignment where appropriate.

License and dataset constraints

The upstream LS-EEND model/codebase used for these CoreML exports is MIT-licensed, and this repository is published as MIT accordingly.

The underlying evaluation and fine-tuning datasets still have their own access and usage terms:

AMI
CALLHOME
DIHARD II
DIHARD III

This repository redistributes CoreML exports of the LS-EEND model variants. Dataset licensing and access requirements remain governed by the original dataset providers.

Citation

If you use LS-EEND, cite the original paper:

@ARTICLE{11122273,
  author={Liang, Di and Li, Xiaofei},
  journal={IEEE Transactions on Audio, Speech and Language Processing},
  title={LS-EEND: Long-Form Streaming End-to-End Neural Diarization With Online Attractor Extraction},
  year={2025},
  volume={33},
  pages={3568-3581},
  doi={10.1109/TASLPRO.2025.3597446}
}

Downloads last month: -