New RL Method Boosts LLM Reasoning Beyond Base Model Limits

TLDR: A new reinforcement learning method called RAPO (Rewards-Aware Policy Optimization) has been developed to enhance the reasoning capabilities of large language models (LLMs). Traditional RL methods often struggle to explore beyond the LLM’s initial knowledge, leading to performance plateaus. RAPO addresses this by using a ‘forward KL divergence’ for exploring entirely new solution paths and a ‘reward-aware reweighting’ mechanism for smarter exploration within known areas. Tested on Qwen2.5 models for mathematical problem-solving, RAPO consistently outperformed existing methods and even solved problems that the base models couldn’t, demonstrating its ability to unlock deeper reasoning potential.

Large Language Models (LLMs) have made incredible strides in recent years, especially in complex reasoning tasks like solving mathematical problems. A key technique behind many of these advancements is Reinforcement Learning with Verifiable Rewards (RLVR), where models learn by getting feedback on the correctness of their solutions.

However, a significant challenge has emerged: while RLVR-trained models initially show better performance, this advantage often disappears or even reverses as they are given more attempts to solve a problem. This suggests that current RLVR methods primarily refine existing knowledge rather than enabling true exploration beyond the model’s initial capabilities. Essentially, the models get stuck within the ‘search space’ defined by their original training.

The Problem with Current RLVR: Limited Exploration

Researchers attribute this limitation to a common component in RLVR: the reverse Kullback-Leibler (KL) divergence regularizer. This mathematical tool, while useful for stabilizing training, has a ‘mode-seeking’ behavior. This means it keeps the LLM’s learning process confined to the high-probability regions of its original, pre-trained knowledge. It struggles to assign any probability to solutions that are entirely new or ‘out-of-distribution’—even if those solutions could be highly rewarding.

Imagine an LLM trying to solve a math problem. If its initial training didn’t expose it to a particular type of solution, the reverse KL divergence would prevent it from ever discovering that solution, no matter how much it tries to learn. This creates a performance ceiling, limiting the model’s ability to develop fundamentally new reasoning strategies.

Introducing RAPO: Rewards-Aware Policy Optimization

To break through this barrier, a new algorithm called RAPO (Rewards-Aware Policy Optimization) has been proposed by Wenhao Deng, Long Wei, Chenglei Yu, and Tailin Wu from Westlake University. RAPO is designed to promote broader yet focused exploration, allowing LLMs to discover novel solutions.

RAPO introduces two main innovations:

1. Forward KL Divergence for Out-of-Distribution Exploration: Instead of the restrictive reverse KL divergence, RAPO uses a forward KL penalty. This allows the model to assign probability to solutions even if they were initially very unlikely or completely absent in its base knowledge. This is crucial for discovering truly new reasoning paths and solving problems that were previously intractable.

2. Reward-Aware Reference Policy Reweighting for In-Distribution Exploration: RAPO also dynamically adjusts how it explores within its existing knowledge. It reweights its ‘reference policy’ (its base behavior) based on the rewards it receives. If a region of its knowledge consistently yields low rewards, RAPO encourages more exploration there, pushing it towards new variations. Conversely, in high-reward regions, it maintains its existing successful strategies.

By combining these two mechanisms, RAPO enables a more effective and adaptive exploration strategy, both within and beyond the LLM’s initial capabilities.

Experimental Success on Challenging Math Problems

The effectiveness of RAPO was tested by training Qwen2.5 models (3B and 7B parameters) on a dataset of 8,000 mathematical problems. These models were then evaluated on challenging benchmarks like AIME2024 and AIME2025, which feature complex math questions.

The results were highly promising. RAPO consistently improved problem-solving performance, significantly outperforming traditional RLVR approaches. Crucially, RAPO-trained models were able to surpass the performance limits of their base models and successfully solve problems that the base models couldn’t tackle at all, even with many attempts. This demonstrates RAPO’s ability to truly unlock new reasoning strategies.

For a deeper dive into the technical details and experimental data, you can read the full research paper: Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration.

Also Read:

Looking Ahead

While RAPO marks a significant step forward, the researchers acknowledge some limitations. Its biggest advantages are seen with larger sampling budgets, meaning it might be less efficient when only a few attempts are allowed. Future work will focus on improving its efficiency for limited sampling scenarios and exploring its applicability to other domains beyond mathematical reasoning, especially those with less structured reward systems.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

New RL Method Boosts LLM Reasoning Beyond Base Model Limits

The Problem with Current RLVR: Limited Exploration

Introducing RAPO: Rewards-Aware Policy Optimization

Experimental Success on Challenging Math Problems

Looking Ahead

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates