TLDR: A new research paper introduces “temperature scaling” as a powerful method to enhance the reasoning abilities of large language models (LLMs) during inference. While increasing the number of samples (K) in test-time scaling (TTS) eventually plateaus, varying the sampling temperature allows LLMs to solve a wider range of “hard” problems, effectively expanding their reasoning boundary. This approach can make base models perform comparably to more complex reinforcement learning-trained models, and an efficient multi-temperature voting method is proposed to reduce computational overhead.
Large Language Models (LLMs) have shown impressive abilities in tackling complex problems, especially when given multiple attempts to reason through a solution. This approach, known as Test-Time Scaling (TTS), involves generating several reasoning traces and then selecting the best one. Traditionally, researchers have focused on increasing the number of samples, or ‘K’, to improve accuracy. However, a new study reveals that this strategy has its limits; beyond a certain point, simply generating more samples doesn’t lead to further gains, and some challenging questions remain unsolved.
A recent paper titled “On the Role of Temperature Sampling in Test-Time Scaling” by Yuheng Wu, Azalia Mirhoseini, and Thierry Tambe from Stanford University, introduces a novel dimension for scaling LLM reasoning: temperature sampling. The authors demonstrate that while increasing ‘K’ at a fixed temperature only explores a part of an LLM’s potential, varying the sampling temperature can unlock a much broader range of problem-solving capabilities.
Understanding Temperature in LLMs
In LLMs, ‘temperature’ is a crucial parameter that influences the randomness of token generation. A low temperature (e.g., 0.0) makes the model’s output more deterministic, always picking the most probable next token. A higher temperature, on the other hand, flattens the probability distribution, encouraging the model to explore a wider variety of less probable tokens, thus increasing diversity and creativity in its responses.
The core insight of this research is that different sampling temperatures enable LLMs to solve different subsets of problems. A question that might be unsolvable at one temperature could become solvable at another. This suggests that a single-temperature approach limits the model’s overall reasoning boundary.
Temperature Scaling: A New Dimension for Improvement
The researchers propose ‘temperature scaling,’ where samples are distributed across multiple temperatures rather than being concentrated at a single one. Their experiments, conducted across various Qwen3 models (0.6B, 1.7B, 4B, 8B) and five reasoning benchmarks (AIME 2024/2025, MATH500, LiveCodeBench, Hi-ToM), showed significant improvements. On average, temperature scaling yielded an additional 7.3 points over single-temperature TTS. For instance, Qwen3-4B on AIME 2025 saw a remarkable 13.3-point gain.
This effect is particularly pronounced for ‘hard’ questions. While ‘easy’ questions can be solved by LLMs regardless of the temperature setting, ‘hard’ questions often require specific temperatures to be cracked. By sampling across a range of temperatures, the model is more likely to hit the ‘sweet spot’ for these difficult problems, effectively expanding its reasoning boundary.
Matching RL-Trained Models Without Extra Training
One of the most compelling findings is that temperature scaling allows base LLMs to achieve performance comparable to models trained with Reinforcement Learning (RL), without the need for costly and time-consuming post-training. This is a significant advantage, as RL training is resource-intensive. The paper illustrates that while simply scaling ‘K’ might narrow the performance gap between base and RL-trained models, it doesn’t eliminate it. However, by also scaling across temperatures, the base model can reach a similar level of success.
Also Read:
- Efficiently Verifying AI’s Step-by-Step Thinking with NCV
- BayesianRouter: A Smart Approach to Aligning Language Models with Human Preferences
Efficient Temperature Scaling
Recognizing that sampling across many temperatures could increase computational overhead, the authors also designed an efficient multi-temperature voting method. This strategy helps identify and ‘early exit’ easy questions, which are reliably solved by any temperature with high probability, thus focusing computational resources on the harder problems. This method resulted in substantial computation reductions (e.g., 54.4% on MATH500 for Qwen3-8B) while maintaining nearly the same performance gains.
In conclusion, this research highlights that Test-Time Scaling is more powerful than previously understood. Temperature scaling offers a straightforward yet highly effective way to unlock the latent reasoning potential of base LLMs, making them more capable and competitive. For more details, you can read the full research paper here.


