Enhancing Large Language Model Reasoning with Concise Outputs

TLDR: This research addresses the ‘overthinking’ problem in Large Reasoning Models (LRMs), where they generate verbose and redundant responses, leading to high computational costs. Existing solutions like length penalties often cause ‘length collapse’ or ‘training collapse.’ The paper proposes a Conciseness Reward Model (CRM) to score reasoning path conciseness and a novel Conciseness Reward Function (CRF). CRF applies conciseness scores only for correct answers, incorporating annealing and difficulty coefficients. This approach theoretically reduces variance and improves convergence, while practically achieving 8.1% accuracy improvement and 19.9% token reduction on Qwen2.5-7B, and generalizes to other LLMs like Llama and Mistral, offering a better balance between reasoning effectiveness and efficiency.

Large Language Models (LLMs) have made incredible strides in reasoning capabilities, giving rise to what are known as Large Reasoning Models (LRMs). However, a common challenge with these advanced models, such as DeepSeek-R1 and OpenAI o1, is their tendency to ‘overthink.’ This means they often generate overly verbose responses filled with redundant or irrelevant steps, significantly increasing computational costs and making their outputs less efficient.

Previous attempts to address this ‘overthinking’ issue typically involved adding length penalties to the reward functions used in reinforcement learning. While seemingly straightforward, researchers have identified two critical problems with this approach: ‘length collapse’ and ‘training collapse.’ Length collapse occurs when models drastically reduce token length but at the expense of impaired logical reasoning, often leading to rote memorization of answers. Training collapse, on the other hand, sees both the reward and the number of tokens decline, indicating a failure in effective learning.

A Novel Approach to Efficient Reasoning

A new research paper, titled “Efficient Reasoning via Reward Model,” introduces an innovative pipeline to tackle these challenges. The authors, Yuhao Wang, Xiaopeng Li, Cheng Gong, Ziru Liu, Suiyun Zhang, Rui Liu, and Xiangyu Zhao from City University of Hong Kong and Huawei Research, propose a two-pronged solution: a Conciseness Reward Model (CRM) and a Conciseness Reward Function (CRF).

The Conciseness Reward Model (CRM)

The CRM is designed to score the conciseness of a reasoning path. Training such a model typically requires extensive human-annotated data, which is resource-intensive and prone to inconsistencies. To overcome this, the researchers developed a unique pipeline:

They started with the DeepMath-103K dataset, known for its challenging problems and diverse mathematical topics.
Using a powerful LLM (Qwen2.5-Math-72B-Instruct), they generated both concise and redundant solutions for mathematical questions.
Another LLM (Qwen2.5-72B-Instruct) was then employed to score these solutions for conciseness, considering factors like repetition avoidance, step relevance, and token efficiency.
Only the most discriminative data (very concise solutions with high scores and very redundant ones with low scores) was retained.
Finally, a smaller LLM (Qwen2.5-3B-Instruct) was fine-tuned using this high-quality dataset to become the Conciseness Reward Model.

The Conciseness Reward Function (CRF)

The CRM provides a conciseness score for any given reasoning path. However, simply adding this score to the traditional outcome reward can still lead to the collapse issues. The core innovation of this work lies in the Conciseness Reward Function (CRF), which establishes an explicit dependency: the conciseness score is applied only when the answer is correct. This crucial design choice helps mitigate reward hacking, where models might prioritize brevity over correctness.

The CRF also incorporates two additional mechanisms:

Annealing Coefficient: This coefficient gradually reduces the weight of the conciseness score as training progresses, helping to stabilize the learning process.
Difficulty Coefficient: Recognizing that harder problems naturally require longer solutions, this coefficient dynamically adjusts the conciseness penalty based on the estimated difficulty of the question. This ensures that models aren’t overly penalized for providing necessary detail on complex problems.

Also Read:

Theoretical and Practical Advantages

From a theoretical standpoint, the researchers demonstrate that their new reward function offers significant benefits, including variance reduction in gradient estimates and improved convergence properties during reinforcement learning. This means more stable and efficient training for LRMs.

The practical results are equally compelling. Extensive experiments across five mathematical benchmark datasets showed that the proposed framework achieved an 8.1% accuracy improvement and a 19.9% reduction in response token length on the Qwen2.5-7B model, compared to existing methods. Furthermore, the method proved compatible and effective with other LLMs, including Llama and Mistral, demonstrating its generalizability.

A case study highlighted the effectiveness of CRF. While traditional methods often produced verbose or even ‘rote memorized’ answers, the CRF-trained model generated significantly shorter, yet logically sound, reasoning paths. This balance between accuracy and efficiency is a key breakthrough.

This research marks a significant step towards developing more efficient and effective large reasoning models, ensuring they provide not just correct answers, but also clear, concise, and computationally less expensive reasoning. The implementation code and datasets are publicly available for further research and reproduction, fostering continued advancements in the field. You can find more details in the full research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing Large Language Model Reasoning with Concise Outputs

A Novel Approach to Efficient Reasoning

The Conciseness Reward Model (CRM)

The Conciseness Reward Function (CRF)

Theoretical and Practical Advantages

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

LinkedIn Revolutionizes People Search with Generative AI for 1.3 Billion Users

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Frontier AI Models Show Advanced Planning Skills, Rivaling Specialized Planners in 2025

Subscribe to get the latest news and updates