TLDR: This research introduces novel multi-temperature strategies for Reinforcement Learning (RL) in Large Language Models (LLMs). It proposes applying different temperatures to “reasoning” (high-entropy) and “knowledge” (low-entropy) tokens during generation to balance exploration and factual accuracy. Additionally, it explores sampling multiple responses per prompt using a range of temperatures. These methods significantly improve LLM reasoning performance on benchmarks without extra computational cost, offering a more robust and effective way to train LLMs.
Large Language Models (LLMs) have become incredibly powerful, excelling in tasks from understanding language to generating code and solving complex math problems. While these models are pre-trained with vast amounts of knowledge, refining their reasoning abilities often requires additional strategies. Reinforcement Learning (RL) has emerged as a promising technique to enhance these higher-order reasoning skills, such as logical inference and problem-solving, without altering the model’s core knowledge.
A crucial but often overlooked aspect in RL for LLMs is “temperature scaling.” This mechanism directly influences the balance between exploration (trying new things) and exploitation (using known good strategies) during the text generation process. Traditionally, a single, uniform temperature value is applied across all tokens and contexts. However, this approach can limit the diversity of outputs and potentially degrade quality because different types of tokens and stages of generation have varying needs for exploration.
Recent research has highlighted that tokens within LLMs play distinct roles during reasoning. Some are “high-entropy reasoning tokens,” where the model is less certain and needs to explore different logical paths. Others are “low-entropy knowledge tokens,” where the model is more confident and needs to maintain factual accuracy. Prior methods have typically encouraged exploration indirectly, for example, by restricting updates. However, they haven’t explicitly facilitated exploratory behavior during the actual token generation.
A new approach introduces a complementary strategy that actively promotes exploration during sampling by applying distinct temperature settings for different token types. This method uses higher temperatures for reasoning tokens to encourage active exploration, while maintaining lower temperatures for knowledge tokens to preserve factual correctness. The researchers also systematically investigated various multi-temperature scheduling strategies and their impact within reinforcement learning contexts.
The core of this innovative method involves a dynamic temperature mechanism guided by the “entropy” (a measure of uncertainty) of individual tokens during generation. When a token has high entropy, indicating high uncertainty, a higher temperature is applied to encourage more diverse sampling. Conversely, for tokens with low entropy, a lower temperature is used to ensure stable and accurate generation. This adaptive approach allows the model to explore more when uncertain and focus more when confident, dynamically adjusting throughout the sequence.
Beyond this token-level control, the research also proposes “multi-temperature sampling per prompt.” Instead of generating responses with a single fixed temperature, the policy simultaneously generates candidate responses under several different temperatures. This creates a richer, more diverse pool of potential answers, allowing the RL system to select the best one. This strategy helps to mitigate the risk of choosing a suboptimal single temperature, especially since the “best” temperature can change as training progresses.
Empirical evaluations on several challenging reasoning benchmarks, including AIME24, AIME25, Minerva, and Olympiad, demonstrated significant improvements in the reasoning performance of LLMs. For instance, the token-level sampling method substantially improved the reasoning performance of Qwen2.5-1.5B-Math, showing gains like +6% on AIME24, +1% on AIME25, and +4.8% on Minerva, all without additional computational cost. The multi-temperature sampling also proved resilient, even when some temperatures were set far outside the empirically stable range.
The findings suggest that both token-level temperature sampling and multiple temperature sampling contribute to better exploration by leveraging higher temperatures, while maintaining stability through lower-temperature sampling. The research also explored how to progressively increase temperature during training, finding that well-timed “spikes” (increments at intervals) can yield notable improvements, and a linear increase can be a robust alternative.
Also Read:
- Enhancing LLM Reasoning with Attribution-Based Credit Assignment and Dynamic Exploration
- Unlocking LLM Reasoning Power with Just 90 Examples: A New Data Augmentation Method
This work offers valuable new insights into configuring temperature effectively for RL-based LLM training, paving the way for more capable and controllable language models. You can read the full research paper here.


