RoRecomp: Making LLMs Reason More Concisely and Efficiently

TLDR: RoRecomp is a new plug-and-play method that enhances the reasoning efficiency of Large Language Models (LLMs) trained with Reinforcement Learning with Verifiable Rewards (RLVR). It addresses the problem of verbose and inefficient responses by strategically recomposing training data into ‘priority batches’ (short-correct and long-incorrect responses) and ‘compensation batches’ (remaining responses). This approach provides clearer optimization signals for brevity without altering the reward function. RoRecomp has been shown to reduce reasoning length by up to 52.5% and unnecessary tool calls by 46.8% across various tasks, with minimal impact on performance, offering a stable way to build more concise and capable reasoning models.

Large Language Models (LLMs) have shown incredible capabilities in complex reasoning, especially when trained with Reinforcement Learning with Verifiable Rewards (RLVR). This approach helps LLMs tackle intricate problems by rewarding them for correct outcomes. However, a significant challenge with standard RLVR training is that it often leads to overly verbose responses and inefficient exploration. Imagine an LLM trying to solve a math problem; instead of a concise solution, it might generate a very long, winding thought process, or an agent using tools might make many unnecessary calls before finding an answer. This verbosity arises because current reward systems primarily focus on the final outcome, offering no direct incentive for efficiency or brevity.

The core issue, as identified by a new research paper, stems from two main problems: high variance in estimating rewards and an inherent bias in some RL algorithms that can actually encourage longer, even incorrect, responses. When LLMs are trained with small groups of responses, the reward signals can be noisy, making it hard for the model to learn what truly efficient reasoning looks like. This often pushes the training process towards generating more verbose outputs rather than concise, accurate ones.

Introducing RoRecomp: A Smarter Way to Train LLMs

To tackle this, researchers from Tencent Youtu Lab, Fudan University, and Nankai University have proposed a novel method called Rollout Response Recomposition, or RoRecomp. This isn’t a complex new algorithm, but rather a clever, plug-and-play approach that guides LLMs towards more concise reasoning by strategically reorganizing the training data itself. Instead of changing how rewards are calculated, RoRecomp changes *what* data the model learns from at each step.

RoRecomp works by separating responses into two distinct types of batches for training:

Priority Batches: These are the stars of the show. They combine responses that are both short and correct with those that are long and incorrect. By focusing the model’s attention on these contrasting examples, RoRecomp provides a very clear signal: be concise and correct, and avoid verbose errors. This helps the model understand the value of brevity directly.
Compensation Batches: To ensure the model remains stable and doesn’t ‘forget’ its broader reasoning abilities, RoRecomp uses a replay buffer to store the remaining, intermediate-length responses. These are periodically used in compensation batches, acting as a regularizer to maintain overall performance and prevent the model from collapsing or becoming too narrowly focused. A dynamic schedule gradually reduces the frequency of these compensation updates, further refining the model’s ability to balance brevity and accuracy over time.

Also Read:

Impressive Results Across Diverse Scenarios

The effectiveness of RoRecomp was rigorously tested across three different settings, demonstrating substantial efficiency gains with minimal impact on performance:

Zero RL Training: In scenarios where RL is applied to base models to encourage efficient reasoning, RoRecomp reduced reasoning length by an impressive 27.7%. For instance, on the Minerva Math benchmark, it cut length by 41.7% while actually improving accuracy.
Agentic RL Training: For LLMs equipped with tools (like search engines) to solve problems, RoRecomp significantly enhanced search efficiency. It reduced unnecessary tool calls by 46.8% while simultaneously improving the F1 score (a measure of accuracy). This means the LLM used its tools more strategically and effectively.
Thinking Compression: When applied to compress the verbose reasoning processes of existing powerful reasoning models, RoRecomp achieved up to a 52.5% reduction in output length. For example, with the DeepSeek-1.5B model, it reduced average response length by 52.5% with only a minimal accuracy drop. Even on the strong Qwen3-8B model, it achieved a 26.4% length reduction while marginally improving accuracy.

An interesting finding from the research is that RoRecomp primarily streamlines the ‘self-verification’ phase of an LLM’s reasoning process. While problem understanding steps saw a more modest reduction, the self-verification steps and tokens were drastically cut, suggesting that much of the lengthy self-correction in standard models is redundant. This indicates that RoRecomp encourages the model to focus more on understanding the problem upfront, leading to more direct and efficient solutions.

Unlike reward shaping methods, which modify the reward function itself and can be tricky to calibrate, RoRecomp intervenes at the data composition level. This makes it a more stable and simpler alternative for enhancing reasoning efficiency. The paper, available at arXiv:2509.25958, highlights that data composition is a powerful, yet often overlooked, lever for optimizing efficiency in LLMs.

In conclusion, RoRecomp offers a practical and effective solution to the problem of verbosity in LLM reasoning. By intelligently recomposing training data, it guides models to be more concise and efficient without sacrificing their problem-solving capabilities, paving the way for more streamlined and powerful AI agents.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

RoRecomp: Making LLMs Reason More Concisely and Efficiently

Introducing RoRecomp: A Smarter Way to Train LLMs

Impressive Results Across Diverse Scenarios

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates