TLDR: Large Language Models (LLMs) can “think” better by using more computation during inference (test-time scaling), but this is slow due to repetitive steps. Speculative decoding speeds this up without losing quality. A new benchmark evaluates various speculative decoding methods (model-based, training-based, N-gram-based) for these reasoning tasks. Findings show N-gram methods are good at handling repetition, hybrid methods offer the best overall speedup, but training-based methods’ performance depends heavily on their training.
Large Language Models (LLMs) are becoming incredibly powerful, especially when tackling complex reasoning tasks. To enhance their problem-solving abilities, a technique called “test-time scaling” allows LLMs to “think” more deeply by allocating extra computational resources during inference. While effective, this approach often leads to significant delays because LLMs tend to generate redundant and repetitive reasoning steps, creating a bottleneck for real-time applications.
Enter speculative decoding, a promising solution designed to mitigate this inefficiency. This technique aims to accelerate the generation process without compromising the quality of the LLM’s output. However, its effectiveness in the specific, repetition-rich context of test-time scaling has not been thoroughly explored until now.
A New Benchmark for Efficient LLM Reasoning
Researchers have introduced the first comprehensive benchmark to evaluate how different speculative decoding methods perform when accelerating LLM test-time scaling. This benchmark provides a consistent way to compare various approaches across common test-time scaling paradigms, such as Best-of-N sampling and multi-round thinking. The study categorizes speculative decoding into three main families: model-based, training-based, and N-gram-based methods.
The evaluation was conducted using popular reasoning datasets like AIME 2024, AIME 2025, MATH-500, and GPQA, and involved LLMs such as DeepSeek-R1-Distill-Llama-8B and Qwen3-8B. The goal was to understand which methods are best suited for making LLM reasoning faster and more practical.
Understanding Test-Time Scaling
The benchmark focuses on two primary test-time scaling frameworks:
- Best-of-N Sampling: In this method, the LLM generates multiple (N) candidate solutions or reasoning paths. A separate verifier then scores and selects the best one, improving the overall output quality.
- Multi-Round Thinking: Inspired by human self-correction, this approach involves the LLM iteratively refining its answers over several rounds. The model re-evaluates its previous output, leading to a progressively improved response.
Both methods, while powerful, are computationally intensive, making them ideal candidates for acceleration through speculative decoding.
How Speculative Decoding Works
Speculative decoding speeds up LLM inference by using a smaller, faster “draft” mechanism to propose a sequence of candidate tokens. These candidates are then quickly verified in a single pass by the larger, original target model. This allows multiple tokens to be generated per target model evaluation, significantly reducing the time taken, all while guaranteeing the final output is identical to what the original model would produce.
The benchmark examined several methods across three categories:
- Model-Based Speculative Sampling (e.g., SpS): Uses a smaller, general-purpose LLM as the drafter. It generates high-quality drafts but its acceleration is limited if the draft model is still relatively large compared to the target model.
- Training-Based Speculative Decoding (e.g., EAGLE-3): Involves training a specialized draft model or adding lightweight decoding heads. These methods can achieve high acceptance rates but require significant upfront training resources and their performance is highly dependent on the quality and scope of this training.
- N-gram-Based Methods (e.g., SAM, Recycling, Lookahead): These are training-free and adaptive. They build drafts by retrieving token sequences from a dynamic cache of recent outputs, excelling at capturing and reusing repetitive patterns.
Key Findings from the Benchmark
The extensive experiments revealed several important insights:
- Training-Based Methods: While showing promising acceleration, their performance is closely tied to the training process, which can limit their adaptability to diverse reasoning scenarios. For instance, EAGLE-3 showed varying performance depending on the training data size for different LLMs.
- N-gram-Based Methods: These methods are particularly effective at capturing and leveraging repetitive patterns in reasoning. SAM, for example, demonstrated superior efficiency in suffix matching, often outperforming training-based methods in speedup. However, N-gram methods are sensitive to sampling temperature; their acceleration benefits tend to decrease as output diversity increases. Recycling, a probabilistic N-gram method, showed better stability to temperature changes but had higher computational overhead.
- Hybrid Speculative Decoding (e.g., SAM[EAGLE-3]): This approach combines the strengths of training-based and N-gram methods. SAM[EAGLE-3] achieved the highest overall speedup across all tested scenarios by leveraging EAGLE-3’s token speculation capabilities and SAM’s ability to capture repetitive patterns. However, it also inherited the temperature sensitivity of its N-gram component.
- Progressive Acceleration: Retrieval-based N-gram methods like SAM and PIA showed a unique ability to achieve progressively greater efficiency gains across multiple reasoning turns. By reusing relevant intermediate results from prior turns, they reduce decoding steps in subsequent turns.
- Computational Overhead: N-gram-based methods generally incur lower draft generation time overhead, allowing more computational resources to be dedicated to the crucial decoding phase.
Also Read:
- ParaThinker: Unlocking LLM Reasoning Potential Through Native Parallel Thinking
- Sticker-TTS: A New Framework for Smarter AI Reasoning Through Historical Experience
Looking Ahead
This benchmark underscores the significant potential of speculative decoding to make LLM test-time scaling more efficient and practical. It highlights the value of integrating N-gram-based methods with other approaches to balance acceleration for both repetitive and diverse reasoning paths. The findings suggest a need for more refined and dynamic hybrid strategies to fully unlock the potential of these acceleration techniques. For more in-depth details, you can read the full research paper here.


