Efficient RLVR Training via Weighted Mutual Information Data Selection
Abstract
InSight is an information-guided data sampling method for reinforcement learning training that improves efficiency by considering both difficulty and epistemic uncertainty through Bayesian modeling and weighted mutual information objectives.
Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models, yet its efficiency critically depends on how training data are selected. Existing online selection strategies predominantly rely on difficulty-based heuristics, favouring datapoints with intermediate success rates, implicitly equating difficulty with informativeness and neglecting epistemic uncertainty arising from limited evidence. We introduce InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective. By modeling data outcomes with Bayesian latent success rates, we show that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection. Leveraging this observation, InSight constructs a stable acquisition score based on the mean belief of datapoints' success rather than noisy sampled outcomes, and naturally extends to multi-rollout settings common in reinforcement learning with verifiable rewards (RLVR). Extensive experiments demonstrate that InSight consistently achieves state-of-the-art performance and improves training efficiency, including a +1.41 average gain on Planning & Mathmatics benchmarks, +1.01 improvement on general reasoning, and up to ~2.2x acceleration, with negligible additional computational overhead.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning (2026)
- Resource-Efficient Reinforcement for Reasoning Large Language Models via Dynamic One-Shot Policy Refinement (2026)
- Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR (2026)
- Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing (2026)
- Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards (2026)
- On-Policy Supervised Fine-Tuning for Efficient Reasoning (2026)
- IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
