Model Card for gemma-2-2b-it-grpo-v6-checkpoints

This model is a fine-tuned version of google/gemma-2-2b-it using Group Relative Policy Optimization (GRPO). It has been trained to improve reasoning capabilities and Chain-of-Thought (CoT) generation.

Model Description

This model was trained using GRPO (Group Relative Policy Optimization), a method introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. Unlike traditional RLHF (PPO), GRPO eliminates the need for a separate critic model (Value Function) by estimating the baseline from the group of outputs generated for the same prompt, making training more efficient and stable.

Quick Start

1. Using Transformers Pipeline

from transformers import pipeline

model_id = "Phonsiri/gemma-2-2b-it-grpo-v6-checkpoints"
generator = pipeline("text-generation", model=model_id, device="cuda")

question = "A store sells notebooks for $3 each and pens for $1.50 each. Sarah buys 4 notebooks and 6 pens. How much does she pay in total?"

output = generator([{"role": "user", "content": question}], max_new_tokens=512, return_full_text=False)

print(output["generated_text"])

2. Loading with PEFT (Adapter)

If you are loading this as an adapter on top of the base model:

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model_id = "google/gemma-2-2b-it"
adapter_model_id = "Phonsiri/gemma-2-2b-it-grpo-v6-checkpoints"

# Load Base Model
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    device_map="auto",
    torch_dtype=torch.float16
)

# Load Adapter
model = PeftModel.from_pretrained(base_model, adapter_model_id)

# Inference
prompt = "Explain the concept of GRPO in simple terms."
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs, skip_special_tokens=True))

Training Details

Training Procedure

Visualize in Weights & Biases

The model was fine-tuned using the TRL library with the GRPOTrainer.

  • Dataset: Phonsiri/gemma3-instruct-reasoning-mix (Synthetic CoT data generated via Google AI Studio).
  • Objective: Enhanced reasoning and multi-step problem solving.
  • Group Size: The model generates a group of outputs for each prompt and optimizes based on their relative rewards.

Framework Versions

  • TRL: 0.27.0
  • Transformers: 4.57.6
  • Pytorch: 2.9.1
  • Datasets: 4.3.0
  • Tokenizers: 0.22.2

Performance & Benchmarks

The model was evaluated using LM Evaluation Harness across standard benchmarks. The results show improvements in reasoning tasks (ARC Challenge) and STEM knowledge, while maintaining stability in other areas compared to the base model.

Evaluation Settings:

  • Limit: 200 samples per task (for faster evaluation turnaround).
  • Shot: 0-shot for most tasks (standard harness configuration).
Benchmark Base Model GRPO v6 Delta Status
ARC Challenge 55.50% 56.50% +1.00% 🟢 Improved
GSM8k 45.50% 45.50% 0.00% ⚪ Stable
HellaSwag 72.50% 72.50% 0.00% ⚪ Stable
MMLU Overall 59.27% 59.37% +0.10% ⚪ Stable
— MMLU STEM 49.40% 49.84% +0.44% ⚪ Stable
— MMLU Humanities 61.40% 61.09% -0.31% ⚪ Stable
— MMLU SocialSci 68.57% 68.87% +0.29% ⚪ Stable
— MMLU Other 60.78% 60.69% -0.09% ⚪ Stable
Average Accuracy 58.20% 58.35% +0.15% 🟢 Slight Improvement

Note: The improvements in ARC Challenge (+1.00%) and MMLU STEM (+0.44%) suggest that the GRPO fine-tuning with reasoning datasets has successfully enhanced the model's logical reasoning and scientific knowledge capabilities.

Intended Use & Limitations

  • Intended Use: Reasoning Tasks, Math problems, logical puzzles, and educational support.
  • Limitations: The model is based on Gemma-2-2B (Small Language Model). It may struggle with highly complex reasoning tasks compared to larger models (7B+). Hallucinations in reasoning steps can still occur.

Acknowledgements

  • Authors:
  • Project Advisor:
    • Supaporn Bunrit, Ph.D. (Suranaree University of Technology)
  • Institutions:
    • Suranaree University of Technology (SUT): For supporting the research and computing resources.
  • Google DeepMind: For the open-weights Gemma 2 model.
  • DeepSeek AI: For introducing the GRPO methodology.

Citations

Cite GRPO as:

@article{shao2024deepseekmath,
    title        = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
    author       = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
    year         = 2024,
    eprint       = {arXiv:2402.03300},
}

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for Phonsiri/gemma-2-2b-it-grpo-v6-checkpoints

Finetuned
(867)
this model

Dataset used to train Phonsiri/gemma-2-2b-it-grpo-v6-checkpoints

Space using Phonsiri/gemma-2-2b-it-grpo-v6-checkpoints 1

Paper for Phonsiri/gemma-2-2b-it-grpo-v6-checkpoints