Qwen3-4B-GSM8K-Synth-35K

A Qwen3-4B model fine-tuned on 34,818 synthetic grade-school math problems using QLoRA, designed for step-by-step mathematical reasoning with chain-of-thought.

What This Model Does

Given a math word problem, the model produces a structured reasoning chain inside <think> tags, then outputs the final numerical answer.

Example

Input:

If 3x + 7 = 22, what is x?

Output:

<think>
Step 1: Subtract 7 from both sides: 3x = 22 - 7 = 15
Step 2: Divide by 3: x = 15 / 3 = 5
</think>

The answer is 5.0.

Evaluation Results

Evaluated on the full GSM8K test set (1,319 questions) with greedy decoding and 4-bit quantization.

Model	GSM8K Accuracy	Correct / Total	Time
Base Qwen3-4B	74.7%	985 / 1,319	104.9m
Qwen3-4B-GSM8K-Synth-35K	85.0%	1,121 / 1,319	41.7m

Fine-tuning improvement: +10.3 percentage points over the base model.

The fine-tuned model also runs ~2.5x faster at inference due to shorter, more structured outputs (the base model produces verbose markdown formatting while the fine-tuned model outputs concise step-by-step solutions).

Cross-Model Comparison

Model	Params	Training Data	GSM8K Accuracy	vs Base
Base Qwen3-4B	4B	—	74.7%	—
Qwen3-4B-GSM8K-Synth-35K	4B	35K synthetic	85.0%	+10.3%
Base Qwen3-8B	8B	—	79.4%	—
Qwen3-8B-GSM8K-Synth-50K	8B	50K synthetic	86.2%	+6.8%

Key takeaways:

Synthetic data fine-tuning provides a substantial accuracy boost at both model scales (+10.3% for 4B, +6.8% for 8B)
The 4B fine-tuned model nearly matches the 8B fine-tuned model (85.0% vs 86.2%), despite being half the size
The 4B fine-tuned model even surpasses the base 8B model (85.0% vs 79.4%)

Training Details

Base Model & Method

Parameter	Value
Base model	Qwen/Qwen3-4B
Method	QLoRA (4-bit NF4 quantization)
Framework	Unsloth + HuggingFace TRL
Merge	Fully merged to 16-bit (no adapter needed at inference)

QLoRA Configuration

Parameter	Value
LoRA rank	32
LoRA alpha	32
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Dropout	0
Trainable parameters	66M / 4.09B (1.62%)

Training Hyperparameters

Parameter	Value
Epochs	3
Batch size	1 (per device)
Gradient accumulation	16 (effective batch = 16)
Learning rate	2e-4 (cosine schedule)
Warmup steps	10
Optimizer	AdamW 8-bit
Precision	bf16
Max sequence length	4096
Max grad norm	1.0
Seed	42
Total steps	6,531

Training Loss Curve

Epoch 1: 0.695 → 0.312  (rapid descent)
Epoch 2: 0.304 → 0.280  (steady refinement)
Epoch 3: 0.248 → 0.237  (final polish)

Final training loss: 0.291 (avg over 3 epochs)

Milestone	Loss	Epoch
Step 50	0.695	0.02
Step 500	0.348	0.23
Step 1000	0.327	0.46
Step 2000	0.318	0.92
Step 2177 (Epoch 1→2)	0.304	1.01
Step 3000	0.289	1.38
Step 4000	0.281	1.84
Step 4354 (Epoch 2→3)	0.248	2.02
Step 5000	0.242	2.30
Step 6000	0.237	2.76
Step 6531	0.239	3.00

Hardware & Time

Metric	Value
GPU	NVIDIA RTX 4070 SUPER (12GB VRAM)
Training time	3h 18m (11,915 seconds)
Throughput	8.77 samples/sec, 0.55 steps/sec
Peak VRAM	~8.1 GB

Training Data

Trained on 34,818 examples from clarkkitchen22/SynthGSM8K-50K — a synthetic grade-school math dataset generated by Claude Haiku 4.5 via Anthropic's Batch API, then filtered through an 8-stage quality pipeline.

Data Format

Each training example follows the Qwen3 ChatML format with thinking tags:

<|im_start|>user
{math word problem}<|im_end|>
<|im_start|>assistant
<think>
{step-by-step solution}
</think>

The answer is {number}.<|im_end|>

GSM8K-style calculation annotations (e.g., <<24*3=72>>) are stripped from solutions during preprocessing.

Dataset Highlights

50,418 total problems in the dataset (34,818 used for this training run)
Generated via few-shot prompting from 200 real GSM8K seed problems
8-stage filter pipeline: structure, answer range, solution quality, AI detection, math verification, exact dedup, fuzzy dedup (TF-IDF @ 0.85), seed overlap
Average 3.0 math operations per solution
92.6% integer answers, range 0–225,000
~$55 generation cost (Haiku 4.5 Batch API at 50% discount)

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "clarkkitchen22/Qwen3-4B-GSM8K-Synth-35K"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

messages = [{"role": "user", "content": "A store sells apples for $2 each and oranges for $3 each. If Sarah buys 5 apples and 4 oranges, how much does she spend?"}]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.6, top_p=0.95)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Answer Extraction

import re

def extract_answer(text):
    """Extract numerical answer from model output."""
    match = re.search(r"answer\s*(?:is|:)\s*([-\d,]+\.?\d*)", text, re.IGNORECASE)
    if match:
        return float(match.group(1).replace(",", ""))
    matches = re.findall(r"([-\d,]+\.?\d+)", text)
    return float(matches[-1].replace(",", "")) if matches else None

Intended Use

Math tutoring: Step-by-step solutions to grade-school math problems
Research: Studying the effect of synthetic data scale on math reasoning
Distillation baseline: Comparing synthetic-data-trained small models against larger models
Further fine-tuning: Starting point for domain-specific math reasoning tasks

Limitations

Trained on synthetic data generated by Haiku 4.5 — bounded by that model's math ability
Optimized for GSM8K-style word problems (arithmetic, basic algebra) — not calculus, geometry, or advanced math
All training answers are non-negative; may struggle with problems requiring negative answers
Solutions use a specific <think> tag format — other prompting styles may give worse results
Evaluated on GSM8K only — performance on other math benchmarks (MATH, MMLU-Math) not yet tested

How It Was Built

End-to-End Pipeline

200 GSM8K seeds → Claude Haiku 4.5 (Batch API) → 83K raw problems
    → 8-stage filter → 50K clean dataset → QLoRA fine-tune Qwen3-4B
    → Merge to 16-bit → Push to HuggingFace

Pipeline Code

The full data generation pipeline and training code is available at: github.com/goldbar123467/SynthDataGSM8K

Citation

@model{qwen3_4b_gsm8k_synth_35k,
  title={Qwen3-4B-GSM8K-Synth-35K},
  author={clarkkitchen22},
  year={2026},
  base_model={Qwen/Qwen3-4B},
  training_data={clarkkitchen22/SynthGSM8K-50K},
  url={https://huggingface.co/clarkkitchen22/Qwen3-4B-GSM8K-Synth-35K}
}

Acknowledgements

Base model: Qwen/Qwen3-4B by Alibaba
Training data: SynthGSM8K-50K — synthetic math problems from Claude Haiku 4.5
Training framework: Unsloth (2x faster QLoRA)
Seed data: OpenAI GSM8K

Downloads last month: 9

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for clarkkitchen22/Qwen3-4B-GSM8K-Synth-35K

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Finetuned

(577)

this model

Dataset used to train clarkkitchen22/Qwen3-4B-GSM8K-Synth-35K

Evaluation results

GSM8K Accuracy on GSM8K
self-reported

85.000
Training Loss (final) on GSM8K
self-reported

0.291