spot_img
HomeResearch & DevelopmentEnhancing Large Language Model Reasoning with Concise Outputs

Enhancing Large Language Model Reasoning with Concise Outputs

TLDR: This research addresses the ‘overthinking’ problem in Large Reasoning Models (LRMs), where they generate verbose and redundant responses, leading to high computational costs. Existing solutions like length penalties often cause ‘length collapse’ or ‘training collapse.’ The paper proposes a Conciseness Reward Model (CRM) to score reasoning path conciseness and a novel Conciseness Reward Function (CRF). CRF applies conciseness scores only for correct answers, incorporating annealing and difficulty coefficients. This approach theoretically reduces variance and improves convergence, while practically achieving 8.1% accuracy improvement and 19.9% token reduction on Qwen2.5-7B, and generalizes to other LLMs like Llama and Mistral, offering a better balance between reasoning effectiveness and efficiency.

Large Language Models (LLMs) have made incredible strides in reasoning capabilities, giving rise to what are known as Large Reasoning Models (LRMs). However, a common challenge with these advanced models, such as DeepSeek-R1 and OpenAI o1, is their tendency to ‘overthink.’ This means they often generate overly verbose responses filled with redundant or irrelevant steps, significantly increasing computational costs and making their outputs less efficient.

Previous attempts to address this ‘overthinking’ issue typically involved adding length penalties to the reward functions used in reinforcement learning. While seemingly straightforward, researchers have identified two critical problems with this approach: ‘length collapse’ and ‘training collapse.’ Length collapse occurs when models drastically reduce token length but at the expense of impaired logical reasoning, often leading to rote memorization of answers. Training collapse, on the other hand, sees both the reward and the number of tokens decline, indicating a failure in effective learning.

A Novel Approach to Efficient Reasoning

A new research paper, titled “Efficient Reasoning via Reward Model,” introduces an innovative pipeline to tackle these challenges. The authors, Yuhao Wang, Xiaopeng Li, Cheng Gong, Ziru Liu, Suiyun Zhang, Rui Liu, and Xiangyu Zhao from City University of Hong Kong and Huawei Research, propose a two-pronged solution: a Conciseness Reward Model (CRM) and a Conciseness Reward Function (CRF).

The Conciseness Reward Model (CRM)

The CRM is designed to score the conciseness of a reasoning path. Training such a model typically requires extensive human-annotated data, which is resource-intensive and prone to inconsistencies. To overcome this, the researchers developed a unique pipeline:

  • They started with the DeepMath-103K dataset, known for its challenging problems and diverse mathematical topics.
  • Using a powerful LLM (Qwen2.5-Math-72B-Instruct), they generated both concise and redundant solutions for mathematical questions.
  • Another LLM (Qwen2.5-72B-Instruct) was then employed to score these solutions for conciseness, considering factors like repetition avoidance, step relevance, and token efficiency.
  • Only the most discriminative data (very concise solutions with high scores and very redundant ones with low scores) was retained.
  • Finally, a smaller LLM (Qwen2.5-3B-Instruct) was fine-tuned using this high-quality dataset to become the Conciseness Reward Model.

The Conciseness Reward Function (CRF)

The CRM provides a conciseness score for any given reasoning path. However, simply adding this score to the traditional outcome reward can still lead to the collapse issues. The core innovation of this work lies in the Conciseness Reward Function (CRF), which establishes an explicit dependency: the conciseness score is applied only when the answer is correct. This crucial design choice helps mitigate reward hacking, where models might prioritize brevity over correctness.

The CRF also incorporates two additional mechanisms:

  • Annealing Coefficient: This coefficient gradually reduces the weight of the conciseness score as training progresses, helping to stabilize the learning process.
  • Difficulty Coefficient: Recognizing that harder problems naturally require longer solutions, this coefficient dynamically adjusts the conciseness penalty based on the estimated difficulty of the question. This ensures that models aren’t overly penalized for providing necessary detail on complex problems.

Also Read:

Theoretical and Practical Advantages

From a theoretical standpoint, the researchers demonstrate that their new reward function offers significant benefits, including variance reduction in gradient estimates and improved convergence properties during reinforcement learning. This means more stable and efficient training for LRMs.

The practical results are equally compelling. Extensive experiments across five mathematical benchmark datasets showed that the proposed framework achieved an 8.1% accuracy improvement and a 19.9% reduction in response token length on the Qwen2.5-7B model, compared to existing methods. Furthermore, the method proved compatible and effective with other LLMs, including Llama and Mistral, demonstrating its generalizability.

A case study highlighted the effectiveness of CRF. While traditional methods often produced verbose or even ‘rote memorized’ answers, the CRF-trained model generated significantly shorter, yet logically sound, reasoning paths. This balance between accuracy and efficiency is a key breakthrough.

This research marks a significant step towards developing more efficient and effective large reasoning models, ensuring they provide not just correct answers, but also clear, concise, and computationally less expensive reasoning. The implementation code and datasets are publicly available for further research and reproduction, fostering continued advancements in the field. You can find more details in the full research paper.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -