TLDR: LessIsMore is a new training-free sparse attention mechanism for large language models that improves efficiency and accuracy in reasoning tasks. It achieves this by identifying global attention patterns (spatial and recency locality) and using a unified token selection process combined with a stable “recency window” for critical recent tokens. This allows models to attend to significantly fewer tokens, leading to faster decoding and shorter generation lengths without sacrificing accuracy, outperforming existing sparse attention methods.
Large language models (LLMs) have become incredibly powerful for complex reasoning tasks, but their impressive performance often comes with a significant computational cost. This is especially true for tasks that require generating many tokens, like solving intricate math problems or complex logical derivations. Traditional “full attention” mechanisms, where the model considers every single token at each step, can be very slow and memory-intensive, sometimes taking tens of minutes for a single problem.
To address this, researchers have developed “sparse attention” mechanisms. These methods aim to reduce computational overhead by having the model focus only on a crucial subset of tokens, rather than all of them. While promising, existing sparse attention techniques often struggle with reasoning tasks. They tend to lose accuracy over long generation sequences because small errors in token selection can accumulate, leading to a degradation in the model’s ability to maintain logical consistency.
A new research paper introduces an innovative approach called LessIsMore, a training-free sparse attention mechanism designed specifically for reasoning tasks. Unlike previous methods that often require expensive retraining or suffer from accuracy drops, LessIsMore maintains, and in some cases even improves, accuracy while significantly speeding up the decoding process. It achieves this by rethinking how attention mechanisms select important tokens.
Understanding the Core Problem
The challenge with reasoning tasks is their “decode-heavy” nature. They start with short prompts but generate extremely long, multi-step outputs. For instance, a model might generate over 30,000 tokens to solve one math problem. Full attention becomes a bottleneck here. While sparse attention offers a solution, current methods often make “selection errors” that compound over thousands of generated tokens, causing the model to lose track of critical information. This not only hurts accuracy but can also paradoxically lengthen the generation process as the model tries to recover.
LessIsMore: A New Perspective on Attention
The LessIsMore team conducted a detailed analysis of how reasoning models pay attention to tokens. They discovered two key patterns, which they call “localities,” that challenge conventional wisdom:
- Spatial Locality Across Attention Heads: Traditionally, it was thought that each “attention head” within a model had a specialized role and needed its own unique set of important tokens. However, LessIsMore found significant overlap in the importance of tokens across different attention heads, especially within the same “key-value group.” This suggests that a unified, global selection of tokens might be more effective than individual head-specific selections.
- Recency Locality of Recent Tokens: The researchers observed that tokens generated most recently consistently receive high attention in subsequent steps. This “recency window” remains relatively constant in size throughout the reasoning process, reflecting how each logical step builds directly on the previous ones.
How LessIsMore Works
LessIsMore leverages these insights through two main techniques:
1. Unified Attention Head Selection: Instead of each attention head picking its own top tokens, LessIsMore aggregates the top token selections from all heads into a single, unified set. This combined set is then globally ranked, and only the most important tokens are kept within a predefined budget. This approach simplifies the process, improves accuracy by capturing globally important tokens, and makes token retrieval more efficient.
2. Stable Recency Window: To account for the importance of recent information, LessIsMore reserves a fixed proportion of its total token budget specifically for the most recently generated tokens. This ensures that critical contextual information, vital for step-by-step reasoning, is always maintained. This adaptive allocation of resources to recent tokens helps preserve accuracy while keeping the process computationally efficient.
Also Read:
- Keeping Large Language Models Current: A New Framework for Real-Time Knowledge Integration
- Beyond Efficiency: Structured Sparsity Improves Transformer Generalization
Impressive Results
LessIsMore was evaluated on popular reasoning benchmarks like AIME-24/25, GPQA-Diamond, and MATH500, using models like Qwen3-8B and Qwen3-4B. The results are compelling:
- Accuracy: LessIsMore consistently achieved the highest accuracy across all tasks and token budgets, often matching or even surpassing the performance of full attention. For example, on the challenging AIME-24 task, LessIsMore achieved nearly lossless performance even with a very small token budget (2K tokens), significantly outperforming other sparse attention methods.
- Efficiency: LessIsMore demonstrated an average decoding speed-up of 1.1 times compared to full attention. When compared to existing sparse attention methods, it attended to at least 2 times fewer tokens and achieved a 1.13 times end-to-end speed-up, partly due to generating 7% shorter reasoning lengths without sacrificing accuracy. This is a crucial point, as other sparse attention methods often lead to longer generation times due to accumulated errors.
The paper highlights that LessIsMore’s global selection strategy, combined with its stable recency window, provides a more robust way to estimate token importance, which generalizes effectively across the entire reasoning process. This innovative, training-free approach offers a significant step forward in making large reasoning models more efficient without compromising their accuracy.
For more technical details, you can read the full research paper here.


