TLDR: RiskPO is a novel post-training method for Large Language Models (LLMs) that addresses the limitations of existing mean-based reinforcement learning techniques, such as entropy collapse and limited reasoning gains. By introducing a Mixed Value-at-Risk (MVaR) objective and a bundling scheme for questions, RiskPO amplifies learning signals from challenging instances and promotes exploration. This approach has been theoretically proven to mitigate entropy collapse and empirically shown to achieve significant improvements in mathematical reasoning, multi-modal reasoning, and code generation, effectively expanding the LLM’s intrinsic reasoning capabilities.
Large Language Models (LLMs) have become incredibly powerful, but getting them to reason effectively, especially on complex tasks like advanced mathematics or code generation, remains a significant challenge. A popular approach for refining these models after their initial training is called Reinforcement Learning with Verifiable Reward (RLVR). This method uses clear, objective feedback (like a correct or incorrect answer) to guide the model’s learning.
However, current leading RLVR techniques, such as Group Relative Policy Optimization (GRPO), face a fundamental problem: they often suffer from what researchers call ‘entropy collapse.’ Imagine a student who quickly becomes overconfident and stops trying new ways to solve problems, even if their current method only works for easy ones. That’s similar to entropy collapse in LLMs – the model prematurely settles on a narrow set of solutions, limiting its ability to explore and truly master difficult reasoning paths.
The core issue, according to new research, is that these methods primarily focus on maximizing the ‘average’ reward. This ‘mean-based’ approach tends to reinforce common, high-probability answers, while neglecting those rare but crucial reasoning steps that could lead to breakthroughs on harder problems. If an LLM consistently gets all answers wrong for a particular question, the learning signal can even vanish, leaving the model without guidance on its weakest areas.
Introducing RiskPO: A New Approach to LLM Optimization
To tackle these limitations, a team of researchers from Peking University has proposed a novel method called Risk-based Policy Optimization, or RiskPO. This innovative approach shifts away from traditional mean-based objectives and instead uses ‘principled risk measures’ to guide the LLM’s learning.
At the heart of RiskPO is a new objective function called Mixed Value-at-Risk (MVaR). Instead of just looking at the average reward, MVaR intelligently weighs different parts of the reward distribution. Crucially, it amplifies the learning signals from challenging instances – those problems where the model is currently struggling. This prevents the model from becoming overconfident too quickly and encourages it to explore more diverse and effective reasoning strategies.
RiskPO also introduces a clever ‘bundling scheme.’ Since a single question might only provide a simple correct/incorrect signal, RiskPO groups multiple questions into ‘bundles.’ This aggregation transforms sparse, binary feedback into a richer, more informative distribution of scores for the entire bundle. This not only provides a more stable training signal but also helps avoid the problem of zero gradients on difficult questions, ensuring the model always has something to learn from.
Also Read:
- Unlocking Deeper Exploration in LLMs with Risk-Sensitive Reinforcement Learning
- Adaptive Learning: How On-Demand Expert Help Boosts AI Reasoning
Theoretical Backing and Impressive Results
The researchers have theoretically shown that these risk-averse updates actively work to alleviate entropy collapse and promote better exploration within the LLM. This means the model is encouraged to keep trying new things and expand its problem-solving repertoire.
Empirically, RiskPO has demonstrated consistent and significant improvements across a wide range of benchmarks. It outperformed GRPO and its variants in mathematical reasoning tasks (including challenging AIME datasets), multi-modal reasoning, and code generation. For instance, on hard-level mathematical reasoning, RiskPO achieved an average score of 46.65, a notable improvement over the strongest baseline. The gains were particularly evident on the most difficult AIME datasets, where RiskPO surpassed previous methods by a significant margin.
Crucially, the results indicate that RiskPO isn’t just making LLMs more efficient at sampling known answers. Instead, it’s genuinely expanding the ‘reasoning boundary’ of the base models, enabling them to acquire new solution strategies and tackle problems they previously couldn’t solve, even with multiple attempts. This is reflected in its superior performance on Pass@k metrics, which measure success within multiple attempts.
By focusing on the lower tail of the reward distribution – the hardest problems – RiskPO ensures that LLMs continue to learn and improve on their weakest areas, leading to more robust and capable reasoning abilities. This research marks a significant step towards developing LLMs that can truly master complex reasoning tasks by embracing uncertainty and actively seeking out challenges. You can find the full research paper here: RISKPO: RISK-BASEDPOLICYOPTIMIZATION VIA VERIFIABLEREWARD FORLLM POST-TRAINING.


