HOT-Step CPP SuperSep β ONNX Stem Separation Models
Pre-converted ONNX models for multi-stem audio separation in HOT-Step CPP. These run natively via ONNX Runtime GPU β no Python required.
Models
| File | Architecture | Size | Purpose |
|---|---|---|---|
bs_roformer_sw.onnx |
BS-Roformer | 672 MB | Stage 1: Primary 6-stem split (vocals, drums, bass, guitar, piano, other) |
mel_band_roformer_karaoke.onnx |
Mel-Band RoFormer | 875 MB | Stage 2: Vocal sub-separation (lead vs backing) |
mdx23c_drumsep.onnx |
MDX23C | 418 MB | Stage 3: Drum sub-separation (kick, snare, toms, hi-hat, cymbals) |
htdemucs_6s.onnx |
HTDemucs | 105 MB | Stage 4: "Other" stem refinement |
Total: ~2.07 GB
Usage
These models are designed for use with the HOT-Step CPP Model Manager. In the app:
- Open the Model Manager (click "Get More Models" in the Models dropdown)
- Go to the Stem Separation tab
- Click Download on each model (or use the Stem Separation starter pack)
Models are downloaded to models/supersep/ and loaded automatically by the SuperSep engine.
Technical Details
- Format: ONNX (opset 18, legacy TorchScript export)
- Precision: FP32
- Input: Spectrogram representation (STFT performed in C++ engine)
- Output: Separation masks (iSTFT performed in C++ engine)
- Runtime: ONNX Runtime 1.25.1+ with CUDA Execution Provider
The models export only the neural network portion β STFT/iSTFT operations are handled natively in C++ for optimal performance.
Conversion
These were converted from PyTorch checkpoints using the MSS_ONNX_TensorRT toolset with dynamo=False (legacy TorchScript exporter) for compatibility with complex attention architectures.
Attribution & Licenses
Training & Checkpoints
- BS-Roformer checkpoint by aufr33 β trained on the Music Source Separation framework
- Mel-Band RoFormer Karaoke checkpoint by aufr33 & viperx β SDR 10.1956 on karaoke separation
- MDX23C DrumSep checkpoint by aufr33 & jarredou β drum sub-component isolation
- HTDemucs by Meta / Facebook AI Research β Hybrid Transformer architecture
Frameworks & Tools
- Music-Source-Separation-Training by ZFTurbo β training framework for BS-Roformer, Mel-Band RoFormer, and MDX23C architectures
- MSS_ONNX_TensorRT by ZFTurbo β ONNX conversion tooling with STFT extraction and model validation
- Demucs by Meta Research β HTDemucs architecture and pre-trained weights (MIT License)
Architecture Papers
- BS-Roformer: "Music Source Separation with Band-Split RoFormer" (arXiv:2309.02612)
- Mel-Band RoFormer: Mel-frequency variant of Band-Split RoFormer
- MDX23C: Based on TFC-TDF-UNet v3 architecture
- HTDemucs: "Hybrid Transformers for Music Source Separation" (arXiv:2211.08553)
License
The conversion and packaging is released under MIT. Individual model weights are subject to their original training licenses β see the attribution links above for details.