Performance report for UD-Q4_K_XL with 72GB VRAM: 65 t/s
System:
- Nvidia RTX 4090D 48GB
- Nvidia RTX 3090 24GB
- Intel Xeon W5-3425 (12 cores)
- DDR5-4800 256GB (8 channels)
- Ubuntu 24
logs:
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090 D, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-sapphirerapids.so
system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q8_0: 86 tensors
llama_model_loader: - type q4_K: 1 tensors
llama_model_loader: - type q5_K: 36 tensors
llama_model_loader: - type q6_K: 59 tensors
llama_model_loader: - type mxfp4: 336 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 63.65 GiB (4.48 BPW)
llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
srv load_model: loaded multimodal model, '/root/.cache/llama.cpp/router/local-vl-qwen35-122b/unsloth_Qwen3.5-122B-A10B-GGUF_mmproj-BF16.gguf'
srv load_model: initializing slots, n_slots = 2
common_speculative_is_compat: the target context does not support partial sequence removal
srv load_model: speculative decoding not supported by this context
slot load_model: id 0 | task -1 | new slot, n_ctx = 131072
slot load_model: id 1 | task -1 | new slot, n_ctx = 131072
prompt eval time = 1322.38 ms / 1734 tokens ( 0.76 ms per token, 1311.27 tokens per second)
eval time = 36753.18 ms / 2433 tokens ( 15.11 ms per token, 66.20 tokens per second)
models.ini:
version = 1
[local-vl-qwen35-122b]
ctx-size=131072
kv-unified=1
parallel=2
min-p=0.00
top-p=0.95
top-k=20
temp=0.6
compose file:
services:
llama-router:
image: ghcr.io/ggml-org/llama.cpp:server-cuda12-b8124
container_name: router
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
ports:
- "8080:8080"
volumes:
- /home/slavik/.cache/llama.cpp/router:/root/.cache/llama.cpp/router
- ./models.ini:/app/models.ini
entrypoint: ["./llama-server"]
command: >
--models-dir /root/.cache/llama.cpp/router
--models-max 1
--models-preset ./models.ini
--host 0.0.0.0 --port 8080
Looks like UD-Q4_K_XL quant will have to be re-uploaded:
@danielhanchen wrote here:
https://www.reddit.com/r/LocalLLaMA/comments/1resggh/comment/o7g2r2a/
investigating currently on UD-Q4_K_XL - I recently switched to using MXFP4, but as you noted my script most likely got had some issues somewhere - I will update the community asap.
For now using our unsloth_Qwen3.5-35B-A3B-MXFP4_MOE which is also partially dynamic is the correct option, or using Q4_K_M which also uses our imatrix calibration dataset.
Great stuff, thanks for sharing! I get about 50 tok/s generation and 800 tok/s prefill on 256k tok prompt on llama cpp b8157. This is with 72gb vram (1x 4090 + 2x 3090). I'm curious what prefill speed others get with these qwen models. 800tok/s is worst case on a single huge prompt but on a 4k tok prompt I get about 1800tok/s prefill. I'm using UD-Q3_K_XL
Great stuff, thanks for sharing! I get about 50 tok/s generation and 800 tok/s prefill on 256k tok prompt on llama cpp b8157.
Looks like productive speeds. Does it significantly outperform 27B in your use-case?
Would be very interested to know if we're in the era of justifying 2x 24gb -> 3x 24gb, which is a big step in hardware complexity.
Would be very interested to know if we're in the era of justifying 2x 24gb -> 3x 24gb, which is a big step in hardware complexity.
It does outperform the 27b. I haven't used it in a while but I remember the 27b was about 30-50% slower. With the 122b model I can fit the IQ4_XS with about 240k token context and the performance is great with roo code when you add some mcp's to boost the model's knowledge, its very capable. Unfortunately I can't really recommend 3 gpus unless you're aiming for 4 gpus long term or you are happy with llama.cpp inference only. Vllm requires 1 or and even number of gpus and I imagine speeds are much better with it. I'm also limited by pcie lanes on my setup so my speeds should be taken with a grain of salt. I'm hoping to upgrade to a threadripper build later this year that would hopefully increase speeds, especially with moe and partial offload.
I'm also limited by pcie lanes
Number of pcie lanes has very little effect on inference speed. At least with llama.cpp.
vLLM has weird way of offloading memory, which depends on PCIe speed / number of lanes, but it mostly doesn't work anyway. With vLLM you usually need to fully fit in VRAM.