TLDR: Traditional entropy regularization, effective in other RL domains, struggles with Large Language Models (LLMs) due to their vast response spaces and sparse optimal outputs, leading to performance stagnation. This research introduces AEnt, an innovative entropy control method that uses a “clamped entropy” bonus, focusing exploration on a smaller, more relevant set of tokens. Coupled with an automatically adjusted coefficient, AEnt effectively reduces bias and enhances exploration, consistently outperforming existing methods in math-reasoning tasks by stabilizing policy entropy and improving learning efficiency.
Reinforcement Learning (RL) has emerged as a powerful technique for training Large Language Models (LLMs), leading to remarkable advancements in areas like math, coding, and planning. However, a critical component of traditional RL algorithms—entropy control—has shown surprisingly little benefit when applied to LLMs. This research paper, titled “ON ENTROPY CONTROL IN LLM-RL ALGORITHMS” by Han Shen of Ant Group, delves into why conventional entropy regularization falls short in the LLM-RL setting and proposes an innovative solution called AEnt.
For many RL tasks, such as those in robotics and games, entropy regularization is vital. It encourages the policy (the LLM’s decision-making process) to remain exploratory and avoid getting stuck in suboptimal actions. By adding an “entropy bonus” to the reward, the algorithm is incentivized to maintain a certain level of randomness in its choices. This prevents the policy from over-reinforcing a few actions and ensures a broader search for optimal solutions. However, studies have consistently found that this approach yields minimal to no gains in LLM-RL training, a stark contrast to its effectiveness in other domains.
The core issue, as argued by the paper, lies in the unique characteristics of LLMs. They operate within an “extremely large response space,” meaning there are hundreds of thousands of possible tokens (words or sub-word units) to choose from at each step. Compounding this is the “sparsity of optimal outputs,” where only a tiny fraction of these vast possibilities leads to a correct or desired response. Traditional entropy regularization, by trying to make the policy more uniform across this immense space, introduces a significant bias. It spreads the policy’s probability too thinly, making it inefficient and often ineffective in finding the truly optimal, yet rare, actions.
To address this, the paper introduces **AEnt**, an adaptive entropy regularization method that incorporates “token space clamping.” Instead of considering the entire vocabulary, AEnt evaluates entropy on a much smaller, more relevant subset of tokens. This “clamped entropy” is defined on the top-probability tokens of the LLM’s current policy. The intuition here is that while the full vocabulary is massive, the most promising tokens are likely to be within a certain high-probability range, even if they are not yet the absolute highest. By focusing exploration within this more compact and reasonable response set, AEnt significantly reduces the entropy-induced bias, making the regularization more effective.
Furthermore, AEnt features an “automatically adjusted coefficient” for the entropy bonus. Unlike conventional methods that use a fixed coefficient, which can lead to drastic fluctuations in entropy and hinder performance, AEnt dynamically adjusts this coefficient during training. If the clamped entropy is too low, the coefficient is increased to encourage more exploration. If it’s too high, the coefficient is decreased to reduce bias and prioritize reward maximization. This adaptive control helps to stabilize the policy entropy, preventing it from collapsing too early or exploding uncontrollably, which can lead to inefficient, repetitive reasoning patterns.
The effectiveness of AEnt was rigorously tested in math-reasoning tasks using different base models and datasets. The experiments consistently showed that AEnt outperforms baselines, including standard policy optimization methods and conventional entropy-regularized approaches. A key observation was that while baseline methods experienced entropy collapse or wild fluctuations, AEnt maintained stable policy entropy, allowing for continued performance improvement long after other methods saturated. This stability also translated into more compact and efficient responses from the LLM without sacrificing accuracy.
Also Read:
- Guiding LLM Learning: Adapting Exploration Based on Task Difficulty
- DCPO: Enhancing LLM Reasoning Through Adaptive Clipping and Smoothed Reward Standardization
In essence, AEnt provides a much-needed remedy for the challenges of entropy control in LLM-RL. By intelligently narrowing the focus of exploration and adaptively managing the entropy bonus, it unlocks the potential benefits of entropy regularization for large language models. This work paves the way for more efficient and robust training of LLMs, particularly in complex reasoning tasks. You can read the full research paper for more technical details at arxiv.org/pdf/2509.03493.


