LessIsMore: A Training-Free Approach to Efficient AI Reasoning

TLDR: LessIsMore is a new training-free sparse attention mechanism for large language models that improves efficiency and accuracy in reasoning tasks. It achieves this by identifying global attention patterns (spatial and recency locality) and using a unified token selection process combined with a stable “recency window” for critical recent tokens. This allows models to attend to significantly fewer tokens, leading to faster decoding and shorter generation lengths without sacrificing accuracy, outperforming existing sparse attention methods.

Large language models (LLMs) have become incredibly powerful for complex reasoning tasks, but their impressive performance often comes with a significant computational cost. This is especially true for tasks that require generating many tokens, like solving intricate math problems or complex logical derivations. Traditional “full attention” mechanisms, where the model considers every single token at each step, can be very slow and memory-intensive, sometimes taking tens of minutes for a single problem.

To address this, researchers have developed “sparse attention” mechanisms. These methods aim to reduce computational overhead by having the model focus only on a crucial subset of tokens, rather than all of them. While promising, existing sparse attention techniques often struggle with reasoning tasks. They tend to lose accuracy over long generation sequences because small errors in token selection can accumulate, leading to a degradation in the model’s ability to maintain logical consistency.

A new research paper introduces an innovative approach called LessIsMore, a training-free sparse attention mechanism designed specifically for reasoning tasks. Unlike previous methods that often require expensive retraining or suffer from accuracy drops, LessIsMore maintains, and in some cases even improves, accuracy while significantly speeding up the decoding process. It achieves this by rethinking how attention mechanisms select important tokens.

Understanding the Core Problem

The challenge with reasoning tasks is their “decode-heavy” nature. They start with short prompts but generate extremely long, multi-step outputs. For instance, a model might generate over 30,000 tokens to solve one math problem. Full attention becomes a bottleneck here. While sparse attention offers a solution, current methods often make “selection errors” that compound over thousands of generated tokens, causing the model to lose track of critical information. This not only hurts accuracy but can also paradoxically lengthen the generation process as the model tries to recover.

LessIsMore: A New Perspective on Attention

The LessIsMore team conducted a detailed analysis of how reasoning models pay attention to tokens. They discovered two key patterns, which they call “localities,” that challenge conventional wisdom:

Spatial Locality Across Attention Heads: Traditionally, it was thought that each “attention head” within a model had a specialized role and needed its own unique set of important tokens. However, LessIsMore found significant overlap in the importance of tokens across different attention heads, especially within the same “key-value group.” This suggests that a unified, global selection of tokens might be more effective than individual head-specific selections.
Recency Locality of Recent Tokens: The researchers observed that tokens generated most recently consistently receive high attention in subsequent steps. This “recency window” remains relatively constant in size throughout the reasoning process, reflecting how each logical step builds directly on the previous ones.

How LessIsMore Works

LessIsMore leverages these insights through two main techniques:

1. Unified Attention Head Selection: Instead of each attention head picking its own top tokens, LessIsMore aggregates the top token selections from all heads into a single, unified set. This combined set is then globally ranked, and only the most important tokens are kept within a predefined budget. This approach simplifies the process, improves accuracy by capturing globally important tokens, and makes token retrieval more efficient.

2. Stable Recency Window: To account for the importance of recent information, LessIsMore reserves a fixed proportion of its total token budget specifically for the most recently generated tokens. This ensures that critical contextual information, vital for step-by-step reasoning, is always maintained. This adaptive allocation of resources to recent tokens helps preserve accuracy while keeping the process computationally efficient.

Also Read:

Impressive Results

LessIsMore was evaluated on popular reasoning benchmarks like AIME-24/25, GPQA-Diamond, and MATH500, using models like Qwen3-8B and Qwen3-4B. The results are compelling:

Accuracy: LessIsMore consistently achieved the highest accuracy across all tasks and token budgets, often matching or even surpassing the performance of full attention. For example, on the challenging AIME-24 task, LessIsMore achieved nearly lossless performance even with a very small token budget (2K tokens), significantly outperforming other sparse attention methods.
Efficiency: LessIsMore demonstrated an average decoding speed-up of 1.1 times compared to full attention. When compared to existing sparse attention methods, it attended to at least 2 times fewer tokens and achieved a 1.13 times end-to-end speed-up, partly due to generating 7% shorter reasoning lengths without sacrificing accuracy. This is a crucial point, as other sparse attention methods often lead to longer generation times due to accumulated errors.

The paper highlights that LessIsMore’s global selection strategy, combined with its stable recency window, provides a more robust way to estimate token importance, which generalizes effectively across the entire reasoning process. This innovative, training-free approach offers a significant step forward in making large reasoning models more efficient without compromising their accuracy.

For more technical details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

LessIsMore: A Training-Free Approach to Efficient AI Reasoning

Understanding the Core Problem

LessIsMore: A New Perspective on Attention

How LessIsMore Works

Impressive Results

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates