ACE-Step 1.5 GGUF
Pre-quantized GGUF models for acestep.cpp, a portable C++17 implementation of the ACE-Step 1.5 AI Music Generation using GGML.
Text + lyrics in, stereo 48kHz audio out. Runs on CPU, CUDA, Metal, Vulkan.
Quick start
git clone --recurse-submodules https://github.com/ServeurpersoCom/acestep.cpp
cd acestep.cpp
pip install huggingface_hub
./models.sh # downloads Q8_0 turbo essentials (~7.7 GB)
mkdir build && cd build
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j$(nproc)
cd ..
./build/ace-server --models ./models --host 0.0.0.0 --port 8085
Open http://localhost:8085 in your browser. The embedded WebUI handles everything: write a caption, set lyrics, generate, play, and download tracks.
Models are loaded on demand and swapped automatically from the UI.
CLI tools (without the server)
# LM: generate lyrics + audio codes
./build/ace-lm \
--request /tmp/request.json \
--lm models/acestep-5Hz-lm-4B-Q8_0.gguf
# DiT + VAE: synthesize audio
./build/ace-synth \
--request /tmp/request0.json \
--embedding models/Qwen3-Embedding-0.6B-Q8_0.gguf \
--dit models/acestep-v15-turbo-Q8_0.gguf \
--vae models/vae-BF16.gguf
Available models
Text encoder
| File | Quant | Size |
|---|---|---|
| Qwen3-Embedding-0.6B-BF16.gguf | BF16 | 1.2 GB |
| Qwen3-Embedding-0.6B-Q8_0.gguf | Q8_0 | 748 MB |
Frozen Qwen3 encoder (28 layers, 1024-dim). The DiT was trained end-to-end with this exact model. Its CondEncoder projection weights (1024 to 2048) are baked into every DiT checkpoint, so the Text-Enc is architecturally locked to 0.6B.
LM (Qwen3 causal, audio code generation)
| File | Params | Quant | Size |
|---|---|---|---|
| acestep-5Hz-lm-4B-BF16.gguf | 4B | BF16 | 7.9 GB |
| acestep-5Hz-lm-4B-Q8_0.gguf | 4B | Q8_0 | 4.2 GB |
| acestep-5Hz-lm-4B-Q6_K.gguf | 4B | Q6_K | 3.3 GB |
| acestep-5Hz-lm-4B-Q5_K_M.gguf | 4B | Q5_K_M | 2.9 GB |
| acestep-5Hz-lm-1.7B-BF16.gguf | 1.7B | BF16 | 3.5 GB |
| acestep-5Hz-lm-1.7B-Q8_0.gguf | 1.7B | Q8_0 | 1.9 GB |
| acestep-5Hz-lm-0.6B-BF16.gguf | 0.6B | BF16 | 1.3 GB |
| acestep-5Hz-lm-0.6B-Q8_0.gguf | 0.6B | Q8_0 | 677 MB |
Small LMs (0.6B/1.7B) only have BF16 + Q8_0 (too small for aggressive quantization). The 4B LM does not have Q4_K_M (breaks audio code generation).
DiT (flow matching diffusion transformer)
Standard (2B)
Available for 7 variants: turbo, sft, sftturbo50, base, turbo-shift1, turbo-shift3, turbo-continuous.
| Quant | Size per variant |
|---|---|
| BF16 | 4.5 GB |
| Q8_0 | 2.4 GB |
| Q6_K | 1.9 GB |
| Q5_K_M | 1.6 GB |
| Q4_K_M | 1.4 GB |
XL (4B)
Available for 4 variants: xl-turbo, xl-sft, xl-sftturbo50, xl-base.
| Quant | Size per variant |
|---|---|
| BF16 | 9.3 GB |
| Q8_0 | 5.0 GB |
| Q6_K | 3.9 GB |
| Q5_K_M | 3.3 GB |
| Q4_K_M | 2.8 GB |
Turbo: 8 steps. SFT/Base: 32-50 steps. SftTurbo50: weight merge provides a little more richness from the SFT while maintaining the low number of Turbo steps.
VAE
| File | Size |
|---|---|
| vae-BF16.gguf | 322 MB |
Always BF16 (small, bandwidth-bound, quality-critical).
Pipeline
Compose (ace-lm):
Caption -> Qwen3 LM (0.6B/1.7B/4B) -> metadata + lyrics + audio codes (5Hz)
Synthesize (ace-synth / ace-server):
Caption + Lyrics -> Text-Enc (Qwen3-Embedding-0.6B) -> CondEncoder
Audio codes 5Hz -> FSQ detokenizer (neural net in DiT GGUF) -> latents 25Hz
LoRA (optional) -> DiT (flow matching, Euler steps) -> latents 25Hz -> VAE decode -> WAV 48kHz
Cover modes (ace-server):
Source audio -> VAE encode -> latents 25Hz -> DiT context
Reference audio -> timbre conditioning
The LM and DiT were co-trained on the same music data. The LM operates at 5Hz (each token = 200ms of music, vocabulary of 64000 learned codes) and builds the global musical structure autoregressively with creative sampling. The DiT takes over at 25Hz (one frame every 40ms) and uses flow matching to render the high-frequency details: timbre, transients, vocal articulation, stereo imaging.
Both stages support batching for parallel generation.
Music guide
For tips on writing effective prompts, understanding inference parameters, and getting the best results:
- A Musician's Guide (non-technical, for music makers)
- Tutorial (design philosophy, architecture, hyperparameters)
Acknowledgements
Independent C++/GGML implementation based on ACE-Step 1.5 by ACE Studio and StepFun. All original model weights are theirs.
SFT/Turbo merge weights (standard 2B DiT) by Aryanne. SFT/Turbo merge weights (XL 4B DiT) by jeankassio.
Links
- acestep.cpp - source code
- ACE-Step 1.5 - original Python implementation
- ACE-Step model hub - original weights
- Downloads last month
- 60,126
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for Serveurperso/ACE-Step-1.5-GGUF
Base model
ACE-Step/Ace-Step1.5