HOT-Step CPP SuperSep β€” ONNX Stem Separation Models

Pre-converted ONNX models for multi-stem audio separation in HOT-Step CPP. These run natively via ONNX Runtime GPU β€” no Python required.

Models

File Architecture Size Purpose
bs_roformer_sw.onnx BS-Roformer 672 MB Stage 1: Primary 6-stem split (vocals, drums, bass, guitar, piano, other)
mel_band_roformer_karaoke.onnx Mel-Band RoFormer 875 MB Stage 2: Vocal sub-separation (lead vs backing)
mdx23c_drumsep.onnx MDX23C 418 MB Stage 3: Drum sub-separation (kick, snare, toms, hi-hat, cymbals)
htdemucs_6s.onnx HTDemucs 105 MB Stage 4: "Other" stem refinement

Total: ~2.07 GB

Usage

These models are designed for use with the HOT-Step CPP Model Manager. In the app:

  1. Open the Model Manager (click "Get More Models" in the Models dropdown)
  2. Go to the Stem Separation tab
  3. Click Download on each model (or use the Stem Separation starter pack)

Models are downloaded to models/supersep/ and loaded automatically by the SuperSep engine.

Technical Details

  • Format: ONNX (opset 18, legacy TorchScript export)
  • Precision: FP32
  • Input: Spectrogram representation (STFT performed in C++ engine)
  • Output: Separation masks (iSTFT performed in C++ engine)
  • Runtime: ONNX Runtime 1.25.1+ with CUDA Execution Provider

The models export only the neural network portion β€” STFT/iSTFT operations are handled natively in C++ for optimal performance.

Conversion

These were converted from PyTorch checkpoints using the MSS_ONNX_TensorRT toolset with dynamo=False (legacy TorchScript exporter) for compatibility with complex attention architectures.

Attribution & Licenses

Training & Checkpoints

  • BS-Roformer checkpoint by aufr33 β€” trained on the Music Source Separation framework
  • Mel-Band RoFormer Karaoke checkpoint by aufr33 & viperx β€” SDR 10.1956 on karaoke separation
  • MDX23C DrumSep checkpoint by aufr33 & jarredou β€” drum sub-component isolation
  • HTDemucs by Meta / Facebook AI Research β€” Hybrid Transformer architecture

Frameworks & Tools

  • Music-Source-Separation-Training by ZFTurbo β€” training framework for BS-Roformer, Mel-Band RoFormer, and MDX23C architectures
  • MSS_ONNX_TensorRT by ZFTurbo β€” ONNX conversion tooling with STFT extraction and model validation
  • Demucs by Meta Research β€” HTDemucs architecture and pre-trained weights (MIT License)

Architecture Papers

  • BS-Roformer: "Music Source Separation with Band-Split RoFormer" (arXiv:2309.02612)
  • Mel-Band RoFormer: Mel-frequency variant of Band-Split RoFormer
  • MDX23C: Based on TFC-TDF-UNet v3 architecture
  • HTDemucs: "Hybrid Transformers for Music Source Separation" (arXiv:2211.08553)

License

The conversion and packaging is released under MIT. Individual model weights are subject to their original training licenses β€” see the attribution links above for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for scragnog/HOT-Step-CPP-SuperSep