TLDR: PEAR (Phase Entropy Aware Reward) is a novel reward mechanism for Large Reasoning Models (LRMs) that addresses the issue of excessively long explanations. By analyzing the correlation between a model’s predictive uncertainty (entropy) and response length, PEAR penalizes high entropy during the initial ‘thinking phase’ to encourage conciseness, while allowing flexibility in the ‘final answer phase’ to maintain accuracy. This adaptive approach significantly reduces response length (37.8% to 59.4%) with minimal accuracy loss and demonstrates strong generalization across various mathematical reasoning benchmarks.
Large Reasoning Models (LRMs) have shown impressive capabilities in tackling complex reasoning tasks, often by generating detailed step-by-step explanations, known as Chain-of-Thought (CoT). While powerful, these explanations can become excessively long, filled with redundant steps that increase computational costs and make the models less user-friendly. The challenge has been to make these models more concise without sacrificing their accuracy.
A recent research paper, PEAR: Phase Entropy Aware Reward for Efficient Reasoning, by Chen Huang, Wei Lu, and Wenxuan Zhang, introduces an innovative solution to this problem. Their work is based on a crucial observation: there’s a consistent link between a model’s internal uncertainty, measured by ‘entropy,’ and the length of its responses at different stages of reasoning.
Understanding Entropy in Reasoning
The researchers found that during the ‘thinking phase’ – where the model explores various reasoning paths – the entropy is typically high. This reflects an exploratory behavior, often leading to longer, more diverse responses. In contrast, the ‘final answer phase’ exhibits lower entropy, indicating a more confident and deterministic solution. This insight suggests that entropy at different reasoning stages can act as a powerful control mechanism to balance conciseness and performance.
Introducing PEAR: A New Reward Mechanism
Based on this understanding, the paper proposes Phase Entropy Aware Reward (PEAR). This is a novel reward mechanism that incorporates phase-dependent entropy into the model’s training process. Instead of treating all parts of a generated response uniformly, PEAR applies different penalties based on the reasoning phase:
- It penalizes excessive entropy during the ‘thinking phase.’ This encourages models to generate more focused and efficient reasoning traces, cutting down on unnecessary exploration.
- It allows for moderate exploration (or even a slight increase in entropy) during the ‘final answer phase.’ This helps maintain flexibility and completeness in the final solution, ensuring accuracy isn’t compromised.
This adaptive control over response length doesn’t rely on explicit length targets or rigid truncation rules, making it a more flexible and model-driven approach.
Impressive Results and Generalization
Extensive experiments were conducted across four widely used reasoning benchmarks: GSM8K, MATH500, AIME24, and AMC23. The results were compelling: PEAR consistently reduced response length by a significant margin, ranging from 37.8% to 59.4%, while maintaining competitive accuracy with decreases of less than 1%. This demonstrates that guiding models to lower their entropy during the thinking phase effectively eliminates redundant reasoning steps without sacrificing correctness.
Furthermore, PEAR showed strong ‘out-of-distribution’ (OOD) robustness, meaning it performed well even on tasks different from its training data. This highlights that phase-dependent entropy is a universal signal for controlling reasoning efficiency, allowing the approach to generalize effectively across diverse reasoning challenges.
Also Read:
- MixReasoning: A Smart Approach to Efficient Language Model Thinking
- How Information Density Shapes LLM Reasoning Quality
Impact on Reasoning Behavior
The analysis also revealed how PEAR influences the model’s internal reasoning. It consistently reduced overall entropy, with the most significant reduction occurring in the thinking phase. This indicates that PEAR successfully steers models towards more confident and focused reasoning. Interestingly, the final answer phase showed a slight increase in entropy, suggesting that the model retains flexibility when articulating its conclusions.
In essence, PEAR offers a sophisticated way to make Large Reasoning Models more efficient and practical. By leveraging the intrinsic signal of phase-dependent entropy, it enables models to generate shorter, more focused explanations without compromising their problem-solving capabilities, making them more suitable for real-world applications where efficiency is key.


