ACE-Step 1.5 GGUF

Pre-quantized GGUF models for acestep.cpp, a portable C++17 implementation of the ACE-Step 1.5 AI Music Generation using GGML.

Text + lyrics in, stereo 48kHz audio out. Runs on CPU, CUDA, Metal, Vulkan.

Quick start

git clone --recurse-submodules https://github.com/ServeurpersoCom/acestep.cpp
cd acestep.cpp

pip install huggingface_hub
./models.sh           # downloads Q8_0 turbo essentials (~7.7 GB)

mkdir build && cd build
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j$(nproc)
cd ..

./build/ace-server --models ./models --host 0.0.0.0 --port 8085

Open http://localhost:8085 in your browser. The embedded WebUI handles everything: write a caption, set lyrics, generate, play, and download tracks.

Models are loaded on demand and swapped automatically from the UI.

CLI tools (without the server)

# LM: generate lyrics + audio codes
./build/ace-lm \
    --request /tmp/request.json \
    --lm models/acestep-5Hz-lm-4B-Q8_0.gguf

# DiT + VAE: synthesize audio
./build/ace-synth \
    --request /tmp/request0.json \
    --embedding models/Qwen3-Embedding-0.6B-Q8_0.gguf \
    --dit models/acestep-v15-turbo-Q8_0.gguf \
    --vae models/vae-BF16.gguf

Available models

Text encoder

File	Quant	Size
Qwen3-Embedding-0.6B-BF16.gguf	BF16	1.2 GB
Qwen3-Embedding-0.6B-Q8_0.gguf	Q8_0	748 MB

Frozen Qwen3 encoder (28 layers, 1024-dim). The DiT was trained end-to-end with this exact model. Its CondEncoder projection weights (1024 to 2048) are baked into every DiT checkpoint, so the Text-Enc is architecturally locked to 0.6B.

LM (Qwen3 causal, audio code generation)

File	Params	Quant	Size
acestep-5Hz-lm-4B-BF16.gguf	4B	BF16	7.9 GB
acestep-5Hz-lm-4B-Q8_0.gguf	4B	Q8_0	4.2 GB
acestep-5Hz-lm-4B-Q6_K.gguf	4B	Q6_K	3.3 GB
acestep-5Hz-lm-4B-Q5_K_M.gguf	4B	Q5_K_M	2.9 GB
acestep-5Hz-lm-1.7B-BF16.gguf	1.7B	BF16	3.5 GB
acestep-5Hz-lm-1.7B-Q8_0.gguf	1.7B	Q8_0	1.9 GB
acestep-5Hz-lm-0.6B-BF16.gguf	0.6B	BF16	1.3 GB
acestep-5Hz-lm-0.6B-Q8_0.gguf	0.6B	Q8_0	677 MB

Small LMs (0.6B/1.7B) only have BF16 + Q8_0 (too small for aggressive quantization). The 4B LM does not have Q4_K_M (breaks audio code generation).

DiT (flow matching diffusion transformer)

Standard (2B)

Available for 7 variants: turbo, sft, sftturbo50, base, turbo-shift1, turbo-shift3, turbo-continuous.

Quant	Size per variant
BF16	4.5 GB
Q8_0	2.4 GB
Q6_K	1.9 GB
Q5_K_M	1.6 GB
Q4_K_M	1.4 GB

XL (4B)

Available for 4 variants: xl-turbo, xl-sft, xl-sftturbo50, xl-base.

Quant	Size per variant
BF16	9.3 GB
Q8_0	5.0 GB
Q6_K	3.9 GB
Q5_K_M	3.3 GB
Q4_K_M	2.8 GB

Turbo: 8 steps. SFT/Base: 32-50 steps. SftTurbo50: weight merge provides a little more richness from the SFT while maintaining the low number of Turbo steps.

VAE

File	Size
vae-BF16.gguf	322 MB

Always BF16 (small, bandwidth-bound, quality-critical).

Pipeline

Compose (ace-lm):

  Caption -> Qwen3 LM (0.6B/1.7B/4B) -> metadata + lyrics + audio codes (5Hz)

Synthesize (ace-synth / ace-server):

  Caption + Lyrics -> Text-Enc (Qwen3-Embedding-0.6B) -> CondEncoder
  Audio codes 5Hz -> FSQ detokenizer (neural net in DiT GGUF) -> latents 25Hz
  LoRA (optional) -> DiT (flow matching, Euler steps) -> latents 25Hz -> VAE decode -> WAV 48kHz

Cover modes (ace-server):

  Source audio -> VAE encode -> latents 25Hz -> DiT context
  Reference audio -> timbre conditioning

The LM and DiT were co-trained on the same music data. The LM operates at 5Hz (each token = 200ms of music, vocabulary of 64000 learned codes) and builds the global musical structure autoregressively with creative sampling. The DiT takes over at 25Hz (one frame every 40ms) and uses flow matching to render the high-frequency details: timbre, transients, vocal articulation, stereo imaging.

Both stages support batching for parallel generation.

Music guide

For tips on writing effective prompts, understanding inference parameters, and getting the best results:

A Musician's Guide (non-technical, for music makers)
Tutorial (design philosophy, architecture, hyperparameters)

Acknowledgements

Independent C++/GGML implementation based on ACE-Step 1.5 by ACE Studio and StepFun. All original model weights are theirs.

SFT/Turbo merge weights (standard 2B DiT) by Aryanne. SFT/Turbo merge weights (XL 4B DiT) by jeankassio.

Model tree for Serveurperso/ACE-Step-1.5-GGUF

Base model

ACE-Step/Ace-Step1.5

Quantized

(3)

this model

Serveurperso
/

ACE-Step-1.5-GGUF