tinyLM-8M-exp

Tiny 5M-class parameter Qwen3-config causal LM with math-only novelty-gated GQA.

Architecture

Item Value
Config type qwen3
Parameters 8.919M
Layers 10
Hidden size 256
MLP size 768
Query heads 8
KV heads 4
Head dim 32
RoPE theta 2500
Tied embeddings yes
Attention Value
Type GQA
Novelty gate math-only element-wise RMS-normalized abs-delta
Gate floor 0.05

Training

Item Value
Tokenizer AxiomicLabs/GPT-S2-5M
Sequence length 512
Microbatch size 512
Gradient accumulation 4
Effective batch size 2048
Steps 20,000
Validation cadence every 1,000 steps
Raw MC eval cadence every 2,000 steps on ARC-Easy, ARC-Challenge, PIQA, HellaSwag
LR schedule warmup, cosine to min by 10,000, hold min to 15,000, cosine tail to zero by 20,000
Optimizer Muon for middle 2D weights, AdamW for the rest
Special-token policy BOS/EOS are document-level; `<
Dataset Share Config
HuggingFaceFW/fineweb-edu 60.0% sample-100BT
HuggingFaceTB/smollm-corpus 30.0% cosmopedia-v2 only
epfml/FineWeb-HQ 10.0% default

Validation

Metric Value
Dataset Salesforce/wikitext, wikitext-103-raw-v1, validation
Context / stride 512 / 256
Loss 3.1546
Perplexity 23.44
UTF-8 BPB 1.4433
Scored tokens 365,258
UTF-8 bytes 1,151,766

Load And Generate

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "User01110/tinyLM-8M-exp"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True)

inputs = tokenizer("Once upon a time", return_tensors="pt").to(model.device)
print(inputs.input_ids[0][:2].tolist())  # [<|im_start|>, <|bos|>]

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_k=40)

print(tokenizer.decode(output[0], skip_special_tokens=True))

This repo uses a native Qwen3Config plus remote model code for the math-only novelty-gated attention block.

Downloads last month
-
Safetensors
Model size
8.92M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for User01110/tinyLM-8M-exp

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(1006)
this model