RiskPO: Enhancing LLM Reasoning by Tackling Challenging Problems with Risk-Based Optimization

TLDR: RiskPO is a novel post-training method for Large Language Models (LLMs) that addresses the limitations of existing mean-based reinforcement learning techniques, such as entropy collapse and limited reasoning gains. By introducing a Mixed Value-at-Risk (MVaR) objective and a bundling scheme for questions, RiskPO amplifies learning signals from challenging instances and promotes exploration. This approach has been theoretically proven to mitigate entropy collapse and empirically shown to achieve significant improvements in mathematical reasoning, multi-modal reasoning, and code generation, effectively expanding the LLM’s intrinsic reasoning capabilities.

Large Language Models (LLMs) have become incredibly powerful, but getting them to reason effectively, especially on complex tasks like advanced mathematics or code generation, remains a significant challenge. A popular approach for refining these models after their initial training is called Reinforcement Learning with Verifiable Reward (RLVR). This method uses clear, objective feedback (like a correct or incorrect answer) to guide the model’s learning.

However, current leading RLVR techniques, such as Group Relative Policy Optimization (GRPO), face a fundamental problem: they often suffer from what researchers call ‘entropy collapse.’ Imagine a student who quickly becomes overconfident and stops trying new ways to solve problems, even if their current method only works for easy ones. That’s similar to entropy collapse in LLMs – the model prematurely settles on a narrow set of solutions, limiting its ability to explore and truly master difficult reasoning paths.

The core issue, according to new research, is that these methods primarily focus on maximizing the ‘average’ reward. This ‘mean-based’ approach tends to reinforce common, high-probability answers, while neglecting those rare but crucial reasoning steps that could lead to breakthroughs on harder problems. If an LLM consistently gets all answers wrong for a particular question, the learning signal can even vanish, leaving the model without guidance on its weakest areas.

Introducing RiskPO: A New Approach to LLM Optimization

To tackle these limitations, a team of researchers from Peking University has proposed a novel method called Risk-based Policy Optimization, or RiskPO. This innovative approach shifts away from traditional mean-based objectives and instead uses ‘principled risk measures’ to guide the LLM’s learning.

At the heart of RiskPO is a new objective function called Mixed Value-at-Risk (MVaR). Instead of just looking at the average reward, MVaR intelligently weighs different parts of the reward distribution. Crucially, it amplifies the learning signals from challenging instances – those problems where the model is currently struggling. This prevents the model from becoming overconfident too quickly and encourages it to explore more diverse and effective reasoning strategies.

RiskPO also introduces a clever ‘bundling scheme.’ Since a single question might only provide a simple correct/incorrect signal, RiskPO groups multiple questions into ‘bundles.’ This aggregation transforms sparse, binary feedback into a richer, more informative distribution of scores for the entire bundle. This not only provides a more stable training signal but also helps avoid the problem of zero gradients on difficult questions, ensuring the model always has something to learn from.

Also Read:

Theoretical Backing and Impressive Results

The researchers have theoretically shown that these risk-averse updates actively work to alleviate entropy collapse and promote better exploration within the LLM. This means the model is encouraged to keep trying new things and expand its problem-solving repertoire.

Empirically, RiskPO has demonstrated consistent and significant improvements across a wide range of benchmarks. It outperformed GRPO and its variants in mathematical reasoning tasks (including challenging AIME datasets), multi-modal reasoning, and code generation. For instance, on hard-level mathematical reasoning, RiskPO achieved an average score of 46.65, a notable improvement over the strongest baseline. The gains were particularly evident on the most difficult AIME datasets, where RiskPO surpassed previous methods by a significant margin.

Crucially, the results indicate that RiskPO isn’t just making LLMs more efficient at sampling known answers. Instead, it’s genuinely expanding the ‘reasoning boundary’ of the base models, enabling them to acquire new solution strategies and tackle problems they previously couldn’t solve, even with multiple attempts. This is reflected in its superior performance on Pass@k metrics, which measure success within multiple attempts.

By focusing on the lower tail of the reward distribution – the hardest problems – RiskPO ensures that LLMs continue to learn and improve on their weakest areas, leading to more robust and capable reasoning abilities. This research marks a significant step towards developing LLMs that can truly master complex reasoning tasks by embracing uncertainty and actively seeking out challenges. You can find the full research paper here: RISKPO: RISK-BASEDPOLICYOPTIMIZATION VIA VERIFIABLEREWARD FORLLM POST-TRAINING.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

RiskPO: Enhancing LLM Reasoning by Tackling Challenging Problems with Risk-Based Optimization

Introducing RiskPO: A New Approach to LLM Optimization

Theoretical Backing and Impressive Results

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates