Teaching LLMs to Be Concise: A New Approach to Efficient Reasoning

TLDR: A new curriculum learning strategy for large language models (LLMs) called “Train Long, Think Short” uses Group Relative Policy Optimization (GRPO) to improve reasoning efficiency. It starts with generous token budgets and gradually reduces them, forcing models to first explore complex solutions and then distill them into shorter, more efficient reasoning steps. This approach leads to higher accuracy and better token usage compared to traditional fixed-budget training, demonstrating that progressive constraint is a powerful inductive bias for training efficient reasoning models.

Large Language Models (LLMs) have made incredible strides in understanding and generating human-like text, but equipping them with strong reasoning abilities remains a key challenge. Imagine an LLM trying to solve a complex math problem; it needs to think through multiple steps, much like a human would. Traditionally, two main methods have been used to improve this reasoning: supervised fine-tuning, where models learn from human-provided step-by-step solutions, and reinforcement learning (RL), where models learn by getting feedback on their completed reasoning.

One promising RL approach is Group Relative Policy Optimization (GRPO), which helps LLMs learn from sparse feedback by comparing multiple generated responses. Alongside this, there’s been a focus on controlling the length of an LLM’s output, aiming for efficiency without sacrificing accuracy. However, many existing methods use a fixed length budget during training, which doesn’t account for how models naturally learn – first exploring broadly, then refining and compressing their knowledge.

Introducing ‘Train Long, Think Short’

A new research paper titled “Train Long, Think Short: Curriculum Learning for Efficient Reasoning” introduces a novel curriculum learning strategy to address this. Authored by Hasan Abed Al Kader Hammoud, Kumail Alhamoud, Abed Hammoud, Elie Bou-Zeid, Marzyeh Ghassemi, and Bernard Ghanem, this work proposes a dynamic training approach where the LLM starts with a generous token budget for its reasoning process. Over time, this budget is gradually tightened, forcing the model to distill its effective solution strategies into more concise and efficient reasoning steps.

This method is built upon GRPO and incorporates a sophisticated reward system. This system balances three crucial signals: correctness (ensuring the answer is right), length efficiency (encouraging the model to stay within the shrinking token budget), and formatting adherence (making sure the output follows a structured format, like separating the thinking process from the final answer using special tags).

How the Curriculum Works

The core idea is a progressively decaying token budget. The model begins with a large budget, allowing it to explore various reasoning paths and discover effective problem-solving patterns. As training continues, the budget shrinks exponentially. This forces the model to become more efficient, compressing its learned strategies into shorter, yet still accurate, reasoning traces. This mimics how a student might first take ample time to solve a problem, then gradually learn to solve it more quickly and concisely.

Also Read:

Key Findings and Benefits

The researchers conducted experiments using the QWEN-2.5-7B model on mathematical reasoning datasets like GSM8K (grade-school math) and MATH500 (competition-level math). They compared their curriculum learning approach against a base model and a fixed-budget GRPO baseline. The results were compelling:

Improved Accuracy and Efficiency: Curriculum learning consistently outperformed fixed-budget training. Models trained with the curriculum achieved higher accuracy while using significantly fewer tokens, demonstrating both better performance and greater efficiency.
Consistency Across Tasks: The gains were observed across both easier (GSM8K) and harder (MATH500) reasoning tasks, and even generalized well to out-of-distribution problems.
Tunable Trade-offs: The study showed that adjusting the weights of the reward components (correctness vs. length) allows for a controllable trade-off between solution quality and token efficiency. Prioritizing correctness led to slightly longer but more accurate outputs, while emphasizing length produced highly compressed traces.
Impact of Decay Schedule: The rate at which the budget decays also matters. Faster, more aggressive decays favored efficiency, while a gentler, linear decay schedule often led to better accuracy on complex reasoning tasks, suggesting that a smoother compression trajectory can help models retain intricate reasoning strategies.
Reward Function Shape: The specific shape of the length reward function (triangular vs. a flat band) also influenced outcomes. A triangular reward, which incentivizes exploring the full budget before compression, generally yielded higher accuracy compared to a flat-band reward, which might encourage over-compression too early.

This research highlights that the training dynamic itself can be a powerful mechanism for optimization. By progressively constraining the model’s reasoning budget, it learns to be both effective and efficient, producing concise solutions without needing explicit user hints at inference time. This work offers a promising direction for developing more practical and cost-effective LLMs for complex reasoning tasks.

For more detailed information, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Teaching LLMs to Be Concise: A New Approach to Efficient Reasoning

Introducing ‘Train Long, Think Short’

How the Curriculum Works

Key Findings and Benefits

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates