TLDR: AIRL-S is a new framework that unifies reinforcement learning (RL) and search-based techniques for improving Large Language Models (LLMs) in complex reasoning tasks. It learns a “process reward model” (PRM) directly from successful reasoning examples during RL training, eliminating the need for expensive human-labeled data. This PRM then guides search procedures, leading to more robust reasoning, mitigating issues like reward hacking, and achieving significant performance improvements, even matching GPT-4o.
Large Language Models (LLMs) have shown incredible capabilities, but tackling complex reasoning tasks, like advanced math or scientific problems, remains a significant challenge. Traditionally, two main approaches have been used to improve LLMs at “test-time scaling” (TTS) – essentially, making them better at reasoning when they’re actually being used. These are reinforcement learning (RL) methods and search-based techniques.
RL methods, while powerful, often struggle with stability and efficiency because they rely on sparse, outcome-based rewards. Imagine only getting a reward if you solve the entire problem correctly, without any feedback on the steps you took. This makes learning difficult. On the other hand, search-based techniques use pre-trained “process reward models” (PRMs) to guide their exploration of possible solutions. The problem here is that training these PRMs requires a lot of expensive human- or LLM-generated labels for each intermediate step, and they can become less effective if the problem type changes.
Introducing AIRL-S: A Unified Approach
A new research paper titled “Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS” introduces AIRL-S, a novel framework that brings together the best of both worlds. The core idea behind AIRL-S is simple yet profound: the reward function that an LLM learns during its reinforcement learning training can inherently serve as the ideal PRM for guiding its reasoning process during search.
Instead of needing separate, costly labeled data for a PRM, AIRL-S leverages a technique called Adversarial Inverse Reinforcement Learning (AIRL) combined with Group Relative Policy Optimization (GRPO). This allows the system to learn a detailed, dynamic PRM directly from successful reasoning examples, completely eliminating the need for human-labeled intermediate steps. This is a significant cost-saving and efficiency improvement.
How AIRL-S Works
During the training phase, AIRL-S learns a step-by-step PRM. This PRM acts like a judge, distinguishing between good and bad reasoning steps. The LLM’s policy is then updated to generate steps that are deemed “good” by this PRM. At the same time, it also incorporates outcome-based rewards, ensuring the final answer is correct. This dual feedback mechanism helps the LLM learn more robustly.
When it comes to using the LLM for complex problems (inference time), the learned PRM plays a dual role. It acts as a “critic” for the RL process, helping the LLM generate better reasoning chains. Crucially, it also serves as a “heuristic” to guide various search procedures, such as Monte Carlo Tree Search (MCTS), Beam Search, and Best-of-N sampling. This means the PRM helps the LLM explore different reasoning paths more effectively, select the most promising ones, and avoid common pitfalls like “reward hacking” (where the model finds shortcuts to maximize reward without truly solving the problem).
Also Read:
- Atom-Searcher: Guiding AI Towards More Human-Like Research
- Advancing AI’s Problem-Solving: A Dual Approach to Heuristic Design
Impressive Results Across Diverse Tasks
The researchers evaluated AIRL-S across eight different benchmarks, covering a wide range of tasks including mathematics, scientific reasoning, and code generation. The results were compelling. The unified approach improved the base model’s performance by an average of 9%, impressively matching the performance of advanced models like GPT-4o.
Furthermore, when the AIRL-S PRM was integrated into multiple search algorithms, it consistently outperformed all baseline PRMs that were trained using traditional labeled data. This highlights the generalizability and effectiveness of the AIRL-S approach, proving that the reward function learned during RL training is indeed a superior guide for search-based reasoning.
This work represents a significant step forward in making LLMs more capable and cost-effective for complex reasoning tasks, offering a robust solution that unifies two previously separate paradigms in test-time scaling.


