TLDR: This research introduces Length Controlled Preference Optimization (LCPO), a novel method to significantly reduce the output length of Large Reasoning Models (LRMs) without sacrificing their reasoning performance. By analyzing and filtering reasoning paths and using a specialized preference optimization technique, LCPO achieves over 50% length reduction across various math benchmarks with minimal training data, addressing issues of high computational cost and ‘overthinking’ in current LRMs.
Large Reasoning Models (LRMs) have shown impressive capabilities in tackling complex problems by generating detailed, step-by-step thought processes, often referred to as Chain-of-Thought (CoT) reasoning. While effective, this approach frequently leads to extremely long outputs, which can be computationally expensive and sometimes even result in the model ‘overthinking’ simple tasks, producing redundant or incorrect information.
Current efforts to make these models more efficient often involve a trade-off: either reasoning quality is compromised, or extensive computational resources are required for training. This paper, titled ‘Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization,’ addresses these challenges head-on.
The Problem with Lengthy Reasoning
Imagine an LRM solving a relatively easy math problem, yet it generates thousands of tokens to arrive at the answer. This isn’t just inefficient; it significantly increases the computational and memory demands, limiting how these powerful models can be used in real-world applications. Moreover, overly long outputs can indicate ‘overthinking,’ where the model expends unnecessary effort on simple queries, sometimes leading to errors.
Introducing Length Controlled Preference Optimization (LCPO)
Researchers Bin Hong, Jiayu Liu, Zhenya Huang, Kai Zhang, and Mengdi Zhang propose a new method called Length Controlled Preference Optimization (LCPO). Their approach focuses on finding a balance between effective reasoning and efficiency by reducing the length of the generated outputs.
LCPO works by first analyzing the ‘generation space’ of LRMs to identify inherently shorter, yet equally effective, reasoning paths. They achieve this by generating multiple outputs for a given problem and then filtering these ‘trajectories’ based on an estimation of problem difficulty. This allows them to create a dataset of concise, high-quality reasoning examples.
Next, LCPO uses a technique called ‘preference optimization.’ Unlike complex online reinforcement learning methods that demand vast resources, LCPO operates in an ‘offline’ manner, making it much more efficient. The core innovation in LCPO lies in how it balances the implicit reward associated with the model’s negative log-likelihood (NLL) loss, enabling it to effectively learn length preferences even with very limited training data.
Also Read:
- Streamlining AI Reasoning: A New Approach to Combat Overthinking in Large Models
- Boosting LLM Reasoning: A New Approach to Overcome Learning Plateaus
Remarkable Results and Efficiency
The experiments conducted using DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-7B models across six different math reasoning benchmarks (including MATH-500 and GSM8K) yielded impressive results. LCPO successfully reduced the average output length by over 50% across most benchmarks, all while maintaining the original model’s reasoning performance. For instance, on MATH-500, the average output length was reduced by 57.07% while accuracy was largely preserved.
What’s particularly noteworthy is LCPO’s efficiency. It requires only about 0.8 thousand training samples and just 50 training steps, a significant reduction in computational cost compared to previous methods that often need hundreds of thousands of samples and many more steps. This makes LCPO a highly practical solution for fine-tuning LRMs.
The research also highlights that LCPO can adaptively provide smaller length reductions for tasks where the model’s reasoning mode is less variable, ensuring valuable information is not lost. Furthermore, the method demonstrates strong generalizability, effectively reducing output length even in out-of-distribution scenarios like the MMLU dataset, which covers diverse subjects beyond math.
Interestingly, LCPO also helps address the ‘overthinking’ phenomenon. For easier problems, LRMs sometimes generate disproportionately long outputs. After training with LCPO, the average generation length becomes positively correlated with difficulty, meaning easier problems result in shorter, more appropriate responses, and in some cases, even improve accuracy on these simpler tasks.
This work represents a significant step towards making powerful Large Reasoning Models more efficient and practical for a wider range of applications. You can read the full research paper here.


