Performance report for UD-Q4_K_XL with 72GB VRAM: 65 t/s

by SlavikF - opened Feb 25

Feb 25

System:

Nvidia RTX 4090D 48GB
Nvidia RTX 3090 24GB
Intel Xeon W5-3425 (12 cores)
DDR5-4800 256GB (8 channels)
Ubuntu 24

logs:

ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090 D, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-sapphirerapids.so

system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:   86 tensors
llama_model_loader: - type q4_K:    1 tensors
llama_model_loader: - type q5_K:   36 tensors
llama_model_loader: - type q6_K:   59 tensors
llama_model_loader: - type mxfp4:  336 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 63.65 GiB (4.48 BPW) 

llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized

srv    load_model: loaded multimodal model, '/root/.cache/llama.cpp/router/local-vl-qwen35-122b/unsloth_Qwen3.5-122B-A10B-GGUF_mmproj-BF16.gguf'
srv    load_model: initializing slots, n_slots = 2
common_speculative_is_compat: the target context does not support partial sequence removal
srv    load_model: speculative decoding not supported by this context
slot   load_model: id  0 | task -1 | new slot, n_ctx = 131072
slot   load_model: id  1 | task -1 | new slot, n_ctx = 131072

prompt eval time =    1322.38 ms /  1734 tokens (    0.76 ms per token,  1311.27 tokens per second)
       eval time =   36753.18 ms /  2433 tokens (   15.11 ms per token,    66.20 tokens per second)

models.ini:

version = 1

[local-vl-qwen35-122b]
ctx-size=131072
kv-unified=1
parallel=2
min-p=0.00
top-p=0.95
top-k=20
temp=0.6

compose file:

services:
  llama-router:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda12-b8124
    container_name: router
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
    ports:
      - "8080:8080"
    volumes:
      - /home/slavik/.cache/llama.cpp/router:/root/.cache/llama.cpp/router
      - ./models.ini:/app/models.ini
    entrypoint: ["./llama-server"]
    command: >
      --models-dir /root/.cache/llama.cpp/router
      --models-max 1
      --models-preset ./models.ini
      --host 0.0.0.0  --port 8080

SlavikF

Feb 26

Looks like UD-Q4_K_XL quant will have to be re-uploaded:

@danielhanchen wrote here:

https://www.reddit.com/r/LocalLLaMA/comments/1resggh/comment/o7g2r2a/

investigating currently on UD-Q4_K_XL - I recently switched to using MXFP4, but as you noted my script most likely got had some issues somewhere - I will update the community asap.
For now using our unsloth_Qwen3.5-35B-A3B-MXFP4_MOE which is also partially dynamic is the correct option, or using Q4_K_M which also uses our imatrix calibration dataset.

barnesea

Feb 26

Great stuff, thanks for sharing! I get about 50 tok/s generation and 800 tok/s prefill on 256k tok prompt on llama cpp b8157. This is with 72gb vram (1x 4090 + 2x 3090). I'm curious what prefill speed others get with these qwen models. 800tok/s is worst case on a single huge prompt but on a 4k tok prompt I get about 1800tok/s prefill. I'm using UD-Q3_K_XL

BingoBird

Mar 17

Great stuff, thanks for sharing! I get about 50 tok/s generation and 800 tok/s prefill on 256k tok prompt on llama cpp b8157.

Looks like productive speeds. Does it significantly outperform 27B in your use-case?

Would be very interested to know if we're in the era of justifying 2x 24gb -> 3x 24gb, which is a big step in hardware complexity.

barnesea

Mar 19

Would be very interested to know if we're in the era of justifying 2x 24gb -> 3x 24gb, which is a big step in hardware complexity.

It does outperform the 27b. I haven't used it in a while but I remember the 27b was about 30-50% slower. With the 122b model I can fit the IQ4_XS with about 240k token context and the performance is great with roo code when you add some mcp's to boost the model's knowledge, its very capable. Unfortunately I can't really recommend 3 gpus unless you're aiming for 4 gpus long term or you are happy with llama.cpp inference only. Vllm requires 1 or and even number of gpus and I imagine speeds are much better with it. I'm also limited by pcie lanes on my setup so my speeds should be taken with a grain of salt. I'm hoping to upgrade to a threadripper build later this year that would hopefully increase speeds, especially with moe and partial offload.

SlavikF

Mar 19

I'm also limited by pcie lanes

Number of pcie lanes has very little effect on inference speed. At least with llama.cpp.
vLLM has weird way of offloading memory, which depends on PCIe speed / number of lanes, but it mostly doesn't work anyway. With vLLM you usually need to fully fit in VRAM.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment