spot_img
HomeResearch & DevelopmentRoRecomp: Making LLMs Reason More Concisely and Efficiently

RoRecomp: Making LLMs Reason More Concisely and Efficiently

TLDR: RoRecomp is a new plug-and-play method that enhances the reasoning efficiency of Large Language Models (LLMs) trained with Reinforcement Learning with Verifiable Rewards (RLVR). It addresses the problem of verbose and inefficient responses by strategically recomposing training data into ‘priority batches’ (short-correct and long-incorrect responses) and ‘compensation batches’ (remaining responses). This approach provides clearer optimization signals for brevity without altering the reward function. RoRecomp has been shown to reduce reasoning length by up to 52.5% and unnecessary tool calls by 46.8% across various tasks, with minimal impact on performance, offering a stable way to build more concise and capable reasoning models.

Large Language Models (LLMs) have shown incredible capabilities in complex reasoning, especially when trained with Reinforcement Learning with Verifiable Rewards (RLVR). This approach helps LLMs tackle intricate problems by rewarding them for correct outcomes. However, a significant challenge with standard RLVR training is that it often leads to overly verbose responses and inefficient exploration. Imagine an LLM trying to solve a math problem; instead of a concise solution, it might generate a very long, winding thought process, or an agent using tools might make many unnecessary calls before finding an answer. This verbosity arises because current reward systems primarily focus on the final outcome, offering no direct incentive for efficiency or brevity.

The core issue, as identified by a new research paper, stems from two main problems: high variance in estimating rewards and an inherent bias in some RL algorithms that can actually encourage longer, even incorrect, responses. When LLMs are trained with small groups of responses, the reward signals can be noisy, making it hard for the model to learn what truly efficient reasoning looks like. This often pushes the training process towards generating more verbose outputs rather than concise, accurate ones.

Introducing RoRecomp: A Smarter Way to Train LLMs

To tackle this, researchers from Tencent Youtu Lab, Fudan University, and Nankai University have proposed a novel method called Rollout Response Recomposition, or RoRecomp. This isn’t a complex new algorithm, but rather a clever, plug-and-play approach that guides LLMs towards more concise reasoning by strategically reorganizing the training data itself. Instead of changing how rewards are calculated, RoRecomp changes *what* data the model learns from at each step.

RoRecomp works by separating responses into two distinct types of batches for training:

  • Priority Batches: These are the stars of the show. They combine responses that are both short and correct with those that are long and incorrect. By focusing the model’s attention on these contrasting examples, RoRecomp provides a very clear signal: be concise and correct, and avoid verbose errors. This helps the model understand the value of brevity directly.
  • Compensation Batches: To ensure the model remains stable and doesn’t ‘forget’ its broader reasoning abilities, RoRecomp uses a replay buffer to store the remaining, intermediate-length responses. These are periodically used in compensation batches, acting as a regularizer to maintain overall performance and prevent the model from collapsing or becoming too narrowly focused. A dynamic schedule gradually reduces the frequency of these compensation updates, further refining the model’s ability to balance brevity and accuracy over time.

Also Read:

Impressive Results Across Diverse Scenarios

The effectiveness of RoRecomp was rigorously tested across three different settings, demonstrating substantial efficiency gains with minimal impact on performance:

  • Zero RL Training: In scenarios where RL is applied to base models to encourage efficient reasoning, RoRecomp reduced reasoning length by an impressive 27.7%. For instance, on the Minerva Math benchmark, it cut length by 41.7% while actually improving accuracy.
  • Agentic RL Training: For LLMs equipped with tools (like search engines) to solve problems, RoRecomp significantly enhanced search efficiency. It reduced unnecessary tool calls by 46.8% while simultaneously improving the F1 score (a measure of accuracy). This means the LLM used its tools more strategically and effectively.
  • Thinking Compression: When applied to compress the verbose reasoning processes of existing powerful reasoning models, RoRecomp achieved up to a 52.5% reduction in output length. For example, with the DeepSeek-1.5B model, it reduced average response length by 52.5% with only a minimal accuracy drop. Even on the strong Qwen3-8B model, it achieved a 26.4% length reduction while marginally improving accuracy.

An interesting finding from the research is that RoRecomp primarily streamlines the ‘self-verification’ phase of an LLM’s reasoning process. While problem understanding steps saw a more modest reduction, the self-verification steps and tokens were drastically cut, suggesting that much of the lengthy self-correction in standard models is redundant. This indicates that RoRecomp encourages the model to focus more on understanding the problem upfront, leading to more direct and efficient solutions.

Unlike reward shaping methods, which modify the reward function itself and can be tricky to calibrate, RoRecomp intervenes at the data composition level. This makes it a more stable and simpler alternative for enhancing reasoning efficiency. The paper, available at arXiv:2509.25958, highlights that data composition is a powerful, yet often overlooked, lever for optimizing efficiency in LLMs.

In conclusion, RoRecomp offers a practical and effective solution to the problem of verbosity in LLM reasoning. By intelligently recomposing training data, it guides models to be more concise and efficient without sacrificing their problem-solving capabilities, paving the way for more streamlined and powerful AI agents.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -