GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Paper
•
2508.06471
•
Published
•
205
Trellis-quantized GLM-4.7-Flash — a 30B-A3B MoE model compressed to 3.78 bits per weight using sensitivity-aware mixed-precision quantization.
| Metric | Value |
|---|---|
| Effective bits | 3.78 bpw |
| Compression | 4.2× vs FP16 |
| Model size | ~14 GB (vs ~60 GB FP16) |
| Parameters | 29.3B |
| Format | HuggingFace sharded safetensors |
This is a quantized version of zai-org/GLM-4.7-Flash, the strongest model in the 30B class that balances performance and efficiency.
GLM-4.7-Flash features:
Quantized using Trellis (EXL3-style) with Metal Marlin acceleration:
| Bit Width | Tensors | Parameters | % of Model |
|---|---|---|---|
| 6-bit | 3,037 | 9.4B | 32.2% |
| 3-bit | 2,710 | 8.6B | 29.3% |
| 2-bit | 2,736 | 8.6B | 29.3% |
| 4-bit | 575 | 2.1B | 7.2% |
| 5-bit | 196 | 591M | 2.0% |
GLM-4.7-Flash-Trellis-MM/
├── model-00001-of-00007.safetensors # ~2 GB each
├── model-00002-of-00007.safetensors
├── model-00003-of-00007.safetensors
├── model-00004-of-00007.safetensors
├── model-00005-of-00007.safetensors
├── model-00006-of-00007.safetensors
├── model-00007-of-00007.safetensors
├── model.safetensors.index.json # Weight map
├── base_weights.safetensors # Embeddings, norms (FP16)
├── config.json # Model config
├── tokenizer.json # Tokenizer
├── tokenizer_config.json
└── quantization_index.json # Quantization metadata
from metal_marlin.trellis import TrellisForCausalLM
from transformers import AutoTokenizer
model = TrellisForCausalLM.from_pretrained(
"RESMP-DEV/GLM-4.7-Flash-Trellis-3.8bpw",
device="mps"
)
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")
prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps")
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Each quantized tensor has 4 components:
{name}__indices: Packed uint8 Trellis indices{name}__scales: FP16 per-group scales (group_size=128){name}__su: FP16 row scaling factors{name}__sv: FP16 column scaling factors| Device | VRAM | Notes |
|---|---|---|
| Apple M2 Ultra | 64 GB+ | Via Metal Marlin |
| Apple M4 Max | 36 GB+ | Via Metal Marlin |
| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B | GPT-OSS-20B |
|---|---|---|---|
| AIME 2025 | 91.6 | 85.0 | 91.7 |
| GPQA | 75.2 | 73.4 | 71.5 |
| SWE-bench Verified | 59.2 | 22.0 | 34.0 |
| τ²-Bench | 79.5 | 49.0 | 47.7 |
| BrowseComp | 42.8 | 2.29 | 28.3 |
| Metric | Value |
|---|---|
| Decode | 5.4 tok/s |
| Prefill (2K) | 42 tok/s |
| Memory | 16.9 GB |
If you use this model, please cite the original GLM-4.5 paper:
@misc{glm2025glm45,
title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models},
author={GLM Team and Aohan Zeng and Xin Lv and others},
year={2025},
eprint={2508.06471},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.06471},
}
This quantized model inherits the MIT License from the original GLM-4.7-Flash model.
Base model
zai-org/GLM-4.7-Flash