Model Card for gemma-2-2b-it-grpo-v6-checkpoints

This model is a fine-tuned version of google/gemma-2-2b-it using Group Relative Policy Optimization (GRPO). It has been trained to improve reasoning capabilities and Chain-of-Thought (CoT) generation.

Model Description

Model type: Causal Language Model
Language(s): English
License: Gemma Terms of Use
Base model: google/gemma-2-2b-it
Dataset: Phonsiri/gemma3-instruct-reasoning-mix
Training Method: GRPO (Reinforcement Learning) via TRL
Developers/Authors:
- Phonsiri Thabunsri (Phonsiriwillbejommarn)
- CYP777
- Suranaree University of Technology

This model was trained using GRPO (Group Relative Policy Optimization), a method introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. Unlike traditional RLHF (PPO), GRPO eliminates the need for a separate critic model (Value Function) by estimating the baseline from the group of outputs generated for the same prompt, making training more efficient and stable.

Quick Start

1. Using Transformers Pipeline

from transformers import pipeline

model_id = "Phonsiri/gemma-2-2b-it-grpo-v6-checkpoints"
generator = pipeline("text-generation", model=model_id, device="cuda")

question = "A store sells notebooks for $3 each and pens for $1.50 each. Sarah buys 4 notebooks and 6 pens. How much does she pay in total?"

output = generator([{"role": "user", "content": question}], max_new_tokens=512, return_full_text=False)

print(output["generated_text"])

2. Loading with PEFT (Adapter)

If you are loading this as an adapter on top of the base model:

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model_id = "google/gemma-2-2b-it"
adapter_model_id = "Phonsiri/gemma-2-2b-it-grpo-v6-checkpoints"

# Load Base Model
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    device_map="auto",
    torch_dtype=torch.float16
)

# Load Adapter
model = PeftModel.from_pretrained(base_model, adapter_model_id)

# Inference
prompt = "Explain the concept of GRPO in simple terms."
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs, skip_special_tokens=True))

Training Details

Training Procedure

The model was fine-tuned using the TRL library with the GRPOTrainer.

Dataset: Phonsiri/gemma3-instruct-reasoning-mix (Synthetic CoT data generated via Google AI Studio).
Objective: Enhanced reasoning and multi-step problem solving.
Group Size: The model generates a group of outputs for each prompt and optimizes based on their relative rewards.

Framework Versions

TRL: 0.27.0
Transformers: 4.57.6
Pytorch: 2.9.1
Datasets: 4.3.0
Tokenizers: 0.22.2

Performance & Benchmarks

The model was evaluated using LM Evaluation Harness across standard benchmarks. The results show improvements in reasoning tasks (ARC Challenge) and STEM knowledge, while maintaining stability in other areas compared to the base model.

Evaluation Settings:

Limit: 200 samples per task (for faster evaluation turnaround).
Shot: 0-shot for most tasks (standard harness configuration).

Benchmark	Base Model	GRPO v6	Delta	Status
ARC Challenge	55.50%	56.50%	+1.00%	🟢 Improved
GSM8k	45.50%	45.50%	0.00%	⚪ Stable
HellaSwag	72.50%	72.50%	0.00%	⚪ Stable
MMLU Overall	59.27%	59.37%	+0.10%	⚪ Stable
— MMLU STEM	49.40%	49.84%	+0.44%	⚪ Stable
— MMLU Humanities	61.40%	61.09%	-0.31%	⚪ Stable
— MMLU SocialSci	68.57%	68.87%	+0.29%	⚪ Stable
— MMLU Other	60.78%	60.69%	-0.09%	⚪ Stable
Average Accuracy	58.20%	58.35%	+0.15%	🟢 Slight Improvement

Note: The improvements in ARC Challenge (+1.00%) and MMLU STEM (+0.44%) suggest that the GRPO fine-tuning with reasoning datasets has successfully enhanced the model's logical reasoning and scientific knowledge capabilities.

Intended Use & Limitations

Intended Use: Reasoning Tasks, Math problems, logical puzzles, and educational support.
Limitations: The model is based on Gemma-2-2B (Small Language Model). It may struggle with highly complex reasoning tasks compared to larger models (7B+). Hallucinations in reasoning steps can still occur.

Acknowledgements

Authors:
- Phonsiri Thabunsri (@Phonsiriwillbejommarn)
- CYP777 (@CYP777)
Project Advisor:
- Supaporn Bunrit, Ph.D. (Suranaree University of Technology)
Institutions:
- Suranaree University of Technology (SUT): For supporting the research and computing resources.
Google DeepMind: For the open-weights Gemma 2 model.
DeepSeek AI: For introducing the GRPO methodology.

Citations

Cite GRPO as:

@article{shao2024deepseekmath,
    title        = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
    author       = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
    year         = 2024,
    eprint       = {arXiv:2402.03300},
}

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Model tree for Phonsiri/gemma-2-2b-it-grpo-v6-checkpoints

Base model

google/gemma-2-2b

Finetuned

google/gemma-2-2b-it

Finetuned

(867)

this model

Dataset used to train Phonsiri/gemma-2-2b-it-grpo-v6-checkpoints

Space using Phonsiri/gemma-2-2b-it-grpo-v6-checkpoints 1

Paper for Phonsiri/gemma-2-2b-it-grpo-v6-checkpoints

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Paper • 2402.03300 • Published Feb 5, 2024 • 145