Qwen3-4B-GSM8K-Synth-35K
A Qwen3-4B model fine-tuned on 34,818 synthetic grade-school math problems using QLoRA, designed for step-by-step mathematical reasoning with chain-of-thought.
What This Model Does
Given a math word problem, the model produces a structured reasoning chain inside <think> tags, then outputs the final numerical answer.
Example
Input:
If 3x + 7 = 22, what is x?
Output:
<think>
Step 1: Subtract 7 from both sides: 3x = 22 - 7 = 15
Step 2: Divide by 3: x = 15 / 3 = 5
</think>
The answer is 5.0.
Evaluation Results
Evaluated on the full GSM8K test set (1,319 questions) with greedy decoding and 4-bit quantization.
| Model | GSM8K Accuracy | Correct / Total | Time |
|---|---|---|---|
| Base Qwen3-4B | 74.7% | 985 / 1,319 | 104.9m |
| Qwen3-4B-GSM8K-Synth-35K | 85.0% | 1,121 / 1,319 | 41.7m |
Fine-tuning improvement: +10.3 percentage points over the base model.
The fine-tuned model also runs ~2.5x faster at inference due to shorter, more structured outputs (the base model produces verbose markdown formatting while the fine-tuned model outputs concise step-by-step solutions).
Cross-Model Comparison
| Model | Params | Training Data | GSM8K Accuracy | vs Base |
|---|---|---|---|---|
| Base Qwen3-4B | 4B | — | 74.7% | — |
| Qwen3-4B-GSM8K-Synth-35K | 4B | 35K synthetic | 85.0% | +10.3% |
| Base Qwen3-8B | 8B | — | 79.4% | — |
| Qwen3-8B-GSM8K-Synth-50K | 8B | 50K synthetic | 86.2% | +6.8% |
Key takeaways:
- Synthetic data fine-tuning provides a substantial accuracy boost at both model scales (+10.3% for 4B, +6.8% for 8B)
- The 4B fine-tuned model nearly matches the 8B fine-tuned model (85.0% vs 86.2%), despite being half the size
- The 4B fine-tuned model even surpasses the base 8B model (85.0% vs 79.4%)
Training Details
Base Model & Method
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3-4B |
| Method | QLoRA (4-bit NF4 quantization) |
| Framework | Unsloth + HuggingFace TRL |
| Merge | Fully merged to 16-bit (no adapter needed at inference) |
QLoRA Configuration
| Parameter | Value |
|---|---|
| LoRA rank | 32 |
| LoRA alpha | 32 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Dropout | 0 |
| Trainable parameters | 66M / 4.09B (1.62%) |
Training Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 3 |
| Batch size | 1 (per device) |
| Gradient accumulation | 16 (effective batch = 16) |
| Learning rate | 2e-4 (cosine schedule) |
| Warmup steps | 10 |
| Optimizer | AdamW 8-bit |
| Precision | bf16 |
| Max sequence length | 4096 |
| Max grad norm | 1.0 |
| Seed | 42 |
| Total steps | 6,531 |
Training Loss Curve
Epoch 1: 0.695 → 0.312 (rapid descent)
Epoch 2: 0.304 → 0.280 (steady refinement)
Epoch 3: 0.248 → 0.237 (final polish)
Final training loss: 0.291 (avg over 3 epochs)
| Milestone | Loss | Epoch |
|---|---|---|
| Step 50 | 0.695 | 0.02 |
| Step 500 | 0.348 | 0.23 |
| Step 1000 | 0.327 | 0.46 |
| Step 2000 | 0.318 | 0.92 |
| Step 2177 (Epoch 1→2) | 0.304 | 1.01 |
| Step 3000 | 0.289 | 1.38 |
| Step 4000 | 0.281 | 1.84 |
| Step 4354 (Epoch 2→3) | 0.248 | 2.02 |
| Step 5000 | 0.242 | 2.30 |
| Step 6000 | 0.237 | 2.76 |
| Step 6531 | 0.239 | 3.00 |
Hardware & Time
| Metric | Value |
|---|---|
| GPU | NVIDIA RTX 4070 SUPER (12GB VRAM) |
| Training time | 3h 18m (11,915 seconds) |
| Throughput | 8.77 samples/sec, 0.55 steps/sec |
| Peak VRAM | ~8.1 GB |
Training Data
Trained on 34,818 examples from clarkkitchen22/SynthGSM8K-50K — a synthetic grade-school math dataset generated by Claude Haiku 4.5 via Anthropic's Batch API, then filtered through an 8-stage quality pipeline.
Data Format
Each training example follows the Qwen3 ChatML format with thinking tags:
<|im_start|>user
{math word problem}<|im_end|>
<|im_start|>assistant
<think>
{step-by-step solution}
</think>
The answer is {number}.<|im_end|>
GSM8K-style calculation annotations (e.g., <<24*3=72>>) are stripped from solutions during preprocessing.
Dataset Highlights
- 50,418 total problems in the dataset (34,818 used for this training run)
- Generated via few-shot prompting from 200 real GSM8K seed problems
- 8-stage filter pipeline: structure, answer range, solution quality, AI detection, math verification, exact dedup, fuzzy dedup (TF-IDF @ 0.85), seed overlap
- Average 3.0 math operations per solution
- 92.6% integer answers, range 0–225,000
- ~$55 generation cost (Haiku 4.5 Batch API at 50% discount)
Usage
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "clarkkitchen22/Qwen3-4B-GSM8K-Synth-35K"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
messages = [{"role": "user", "content": "A store sells apples for $2 each and oranges for $3 each. If Sarah buys 5 apples and 4 oranges, how much does she spend?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.6, top_p=0.95)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
Answer Extraction
import re
def extract_answer(text):
"""Extract numerical answer from model output."""
match = re.search(r"answer\s*(?:is|:)\s*([-\d,]+\.?\d*)", text, re.IGNORECASE)
if match:
return float(match.group(1).replace(",", ""))
matches = re.findall(r"([-\d,]+\.?\d+)", text)
return float(matches[-1].replace(",", "")) if matches else None
Intended Use
- Math tutoring: Step-by-step solutions to grade-school math problems
- Research: Studying the effect of synthetic data scale on math reasoning
- Distillation baseline: Comparing synthetic-data-trained small models against larger models
- Further fine-tuning: Starting point for domain-specific math reasoning tasks
Limitations
- Trained on synthetic data generated by Haiku 4.5 — bounded by that model's math ability
- Optimized for GSM8K-style word problems (arithmetic, basic algebra) — not calculus, geometry, or advanced math
- All training answers are non-negative; may struggle with problems requiring negative answers
- Solutions use a specific
<think>tag format — other prompting styles may give worse results - Evaluated on GSM8K only — performance on other math benchmarks (MATH, MMLU-Math) not yet tested
How It Was Built
End-to-End Pipeline
200 GSM8K seeds → Claude Haiku 4.5 (Batch API) → 83K raw problems
→ 8-stage filter → 50K clean dataset → QLoRA fine-tune Qwen3-4B
→ Merge to 16-bit → Push to HuggingFace
Pipeline Code
The full data generation pipeline and training code is available at: github.com/goldbar123467/SynthDataGSM8K
Citation
@model{qwen3_4b_gsm8k_synth_35k,
title={Qwen3-4B-GSM8K-Synth-35K},
author={clarkkitchen22},
year={2026},
base_model={Qwen/Qwen3-4B},
training_data={clarkkitchen22/SynthGSM8K-50K},
url={https://huggingface.co/clarkkitchen22/Qwen3-4B-GSM8K-Synth-35K}
}
Acknowledgements
- Base model: Qwen/Qwen3-4B by Alibaba
- Training data: SynthGSM8K-50K — synthetic math problems from Claude Haiku 4.5
- Training framework: Unsloth (2x faster QLoRA)
- Seed data: OpenAI GSM8K
- Downloads last month
- 9
Model tree for clarkkitchen22/Qwen3-4B-GSM8K-Synth-35K
Dataset used to train clarkkitchen22/Qwen3-4B-GSM8K-Synth-35K
Evaluation results
- GSM8K Accuracy on GSM8Kself-reported85.000
- Training Loss (final) on GSM8Kself-reported0.291