KernelBench-RLVR-120b

A 120B-parameter model fine-tuned with GRPO (Group Relative Policy Optimization) for GPU kernel generation. This model was used to study compute-optimal test-time strategies in Surprisal-Guided Selection, where we find that Best-of-N search with surprisal-guided selection recovers oracle performance at zero additional cost.

Paper: arXiv:2602.07670 | Code: GitHub

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Jarrodbarnes/KernelBench-RLVR-120b",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Jarrodbarnes/KernelBench-RLVR-120b")

Model Description

This model was trained using an execution-grounded RL framework where:

Environment: KernelBench provides deterministic execution feedback via CUDA compiler and GPU hardware
Reward: Raw speedup (correctness-gated) normalized by running baseline
Algorithm: GRPO with group-relative advantages
Evaluation: Same evaluator as training (no reward hacking possible)

Parameter	Value
Base Model	openai/gpt-oss-120b
Algorithm	GRPO (Group Relative Policy Optimization)
LoRA Rank	16
Training Steps	40
Learning Rate	1e-5
Temperature	0.25
Max Tokens	1024
Training Tasks	80 (KernelBench L1 train split)

Evaluation Results

Training Checkpoint (Step 40):

Correctness: 98.4%
Mean Speedup: 0.87x on training distribution

Best-of-N Search (Full L1 Eval, 20 tasks):

18/20 tasks (90%) achieve fast_1 = 1 at K=64
Performance saturates at K=16 (99.9% on 5-task subsets)

Selection Strategy Comparison (Subset 1, 5 tasks x 2 seeds):

Strategy	fast_1	std	Mean Speedup
best-correct (Oracle)	100%	0%	226.9x
surprisal-guided-top3	100%	0%	139.0x
surprisal-guided	80%	0%	41.2x
random-correct	59.2%	2.7%	30.0x
confidence-guided	50%	14.1%	11.6x

Test-Time Training Comparison (Subset 1, 3 seeds):

Method	fast_1	std	Rollouts
Best-of-N (K=64)	100%	0%	320
Batch-TTT BoA	30.6%	11.3%	960
SDPO Prompt-Only	30.4%	7.6%	320

Note: fast_1 = fraction of samples that are both correct AND achieve speedup > 1x.

Key Findings

This model was developed as part of research on compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks. Three findings:

Surprisal-guided selection recovers oracle performance. Selecting the highest-surprisal (lowest log-probability) correct sample achieves 80% fast_1 vs. 50% for confidence-guided (+30pp, Cohen's h = 0.64). Extending to surprisal-guided-top3 matches oracle at 100%. The model's probability distribution maps frequency, not quality. Rare, hardware-optimized kernels occupy the Expert Tail that surprisal recovers at zero cost.
Search outperforms adaptation. Best-of-N at K=64 achieves 90% task success (18/20 L1 tasks). TTT's Best-of-Adaptation reaches 30.6% (3-seed mean), with "equivalent K" below 1 -- worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions.
Feedback redundancy. SDPO with execution feedback (26.3%) underperforms prompt-only self-distillation (30.4%). When the world provides dense continuous rewards, teacher-based interpretation becomes redundant.

Hardware Requirements

GPU Memory: ~240GB for bf16 inference (e.g., 8x A100 40GB, 4x A100 80GB, or 3x H100)
Disk Space: ~240GB for model weights
Recommended: Use device_map="auto" for automatic multi-GPU distribution

For single-GPU inference, consider using quantization:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "Jarrodbarnes/KernelBench-RLVR-120b",
    quantization_config=quantization_config,
    device_map="auto"
)

Intended Use

This model is designed for GPU kernel optimization research. Given a PyTorch reference implementation, it generates optimized CUDA kernel code.

Input format:

Given the following PyTorch reference implementation:

```python
[reference code]

Write an optimized CUDA kernel that computes the same result.


## Limitations

- Evaluated on KernelBench L1 only (250 ML workloads)
- Hardware-specific optimizations (A100)
- Extended test-time adaptation may cause regression (use BoA selection with early stopping)
- Single model size evaluated (120B)
- Surprisal-guided selection requires sufficient intra-task logprob variance; on 11/20 L1 tasks with near-identical logprobs, all selection strategies perform equivalently

## Citation

If you use this model, please cite [our paper](http://arxiv.org/abs/2602.07670):

```bibtex
@article{barnes2026surprisal,
  title={Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies
         for Execution-Grounded Code Generation},
  author={Barnes, Jarrod},
  journal={arXiv preprint arXiv:2602.07670},
  year={2026},
  url={http://arxiv.org/abs/2602.07670}
}