Model Card for gemma-2-2b-it-grpo-v6-checkpoints
This model is a fine-tuned version of google/gemma-2-2b-it using Group Relative Policy Optimization (GRPO). It has been trained to improve reasoning capabilities and Chain-of-Thought (CoT) generation.
Model Description
- Model type: Causal Language Model
- Language(s): English
- License: Gemma Terms of Use
- Base model: google/gemma-2-2b-it
- Dataset: Phonsiri/gemma3-instruct-reasoning-mix
- Training Method: GRPO (Reinforcement Learning) via TRL
- Developers/Authors:
- Phonsiri Thabunsri (Phonsiriwillbejommarn)
- CYP777
- Suranaree University of Technology
This model was trained using GRPO (Group Relative Policy Optimization), a method introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. Unlike traditional RLHF (PPO), GRPO eliminates the need for a separate critic model (Value Function) by estimating the baseline from the group of outputs generated for the same prompt, making training more efficient and stable.
Quick Start
1. Using Transformers Pipeline
from transformers import pipeline
model_id = "Phonsiri/gemma-2-2b-it-grpo-v6-checkpoints"
generator = pipeline("text-generation", model=model_id, device="cuda")
question = "A store sells notebooks for $3 each and pens for $1.50 each. Sarah buys 4 notebooks and 6 pens. How much does she pay in total?"
output = generator([{"role": "user", "content": question}], max_new_tokens=512, return_full_text=False)
print(output["generated_text"])
2. Loading with PEFT (Adapter)
If you are loading this as an adapter on top of the base model:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
base_model_id = "google/gemma-2-2b-it"
adapter_model_id = "Phonsiri/gemma-2-2b-it-grpo-v6-checkpoints"
# Load Base Model
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
device_map="auto",
torch_dtype=torch.float16
)
# Load Adapter
model = PeftModel.from_pretrained(base_model, adapter_model_id)
# Inference
prompt = "Explain the concept of GRPO in simple terms."
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs, skip_special_tokens=True))
Training Details
Training Procedure
The model was fine-tuned using the TRL library with the GRPOTrainer.
- Dataset:
Phonsiri/gemma3-instruct-reasoning-mix(Synthetic CoT data generated via Google AI Studio). - Objective: Enhanced reasoning and multi-step problem solving.
- Group Size: The model generates a group of outputs for each prompt and optimizes based on their relative rewards.
Framework Versions
- TRL: 0.27.0
- Transformers: 4.57.6
- Pytorch: 2.9.1
- Datasets: 4.3.0
- Tokenizers: 0.22.2
Performance & Benchmarks
The model was evaluated using LM Evaluation Harness across standard benchmarks. The results show improvements in reasoning tasks (ARC Challenge) and STEM knowledge, while maintaining stability in other areas compared to the base model.
Evaluation Settings:
- Limit: 200 samples per task (for faster evaluation turnaround).
- Shot: 0-shot for most tasks (standard harness configuration).
| Benchmark | Base Model | GRPO v6 | Delta | Status |
|---|---|---|---|---|
| ARC Challenge | 55.50% | 56.50% | +1.00% | 🟢 Improved |
| GSM8k | 45.50% | 45.50% | 0.00% | ⚪ Stable |
| HellaSwag | 72.50% | 72.50% | 0.00% | ⚪ Stable |
| MMLU Overall | 59.27% | 59.37% | +0.10% | ⚪ Stable |
| — MMLU STEM | 49.40% | 49.84% | +0.44% | ⚪ Stable |
| — MMLU Humanities | 61.40% | 61.09% | -0.31% | ⚪ Stable |
| — MMLU SocialSci | 68.57% | 68.87% | +0.29% | ⚪ Stable |
| — MMLU Other | 60.78% | 60.69% | -0.09% | ⚪ Stable |
| Average Accuracy | 58.20% | 58.35% | +0.15% | 🟢 Slight Improvement |
Note: The improvements in ARC Challenge (+1.00%) and MMLU STEM (+0.44%) suggest that the GRPO fine-tuning with reasoning datasets has successfully enhanced the model's logical reasoning and scientific knowledge capabilities.
Intended Use & Limitations
- Intended Use: Reasoning Tasks, Math problems, logical puzzles, and educational support.
- Limitations: The model is based on Gemma-2-2B (Small Language Model). It may struggle with highly complex reasoning tasks compared to larger models (7B+). Hallucinations in reasoning steps can still occur.
Acknowledgements
- Authors:
- Phonsiri Thabunsri (@Phonsiriwillbejommarn)
- CYP777 (@CYP777)
- Project Advisor:
- Supaporn Bunrit, Ph.D. (Suranaree University of Technology)
- Institutions:
- Suranaree University of Technology (SUT): For supporting the research and computing resources.
- Google DeepMind: For the open-weights Gemma 2 model.
- DeepSeek AI: For introducing the GRPO methodology.
Citations
Cite GRPO as:
@article{shao2024deepseekmath,
title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
year = 2024,
eprint = {arXiv:2402.03300},
}
Cite TRL as:
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}