TLDR: A new research paper introduces GTPO and GRPO-S, two novel algorithms that enhance Large Language Model (LLM) reasoning by addressing the limitations of coarse-grained reward assignment in traditional Reinforcement Learning (RL). By dynamically weighting rewards based on the policy entropy of individual tokens (GTPO) or sequences (GRPO-S), the methods provide more precise feedback, focusing learning on critical decision points. Experiments show these entropy-weighted approaches significantly improve LLM performance, increasing model entropy, response length, and overall reasoning capabilities compared to existing baselines.
Large Language Models (LLMs) have made incredible strides in complex tasks like mathematics and coding, largely thanks to Reinforcement Learning (RL). Algorithms such as Group Relative Policy Optimization (GRPO) have been instrumental in this advancement. However, a significant challenge persists: the way rewards are assigned during training is often too simplistic, applying a uniform reward to an entire sequence of tokens. This ‘all-or-nothing’ approach means that if a long reasoning process, like a 50-step mathematical proof, has 49 correct steps but one final error, the entire sequence receives no reward. This coarse-grained feedback significantly hinders the model’s ability to learn from its nearly correct attempts, especially in long-chain reasoning tasks.
A New Approach: Dynamic Entropy Weighting
A recent research paper, GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy, introduces an innovative solution to this problem: Dynamic Entropy Weighting. The core idea is that in correct responses, tokens where the model’s policy exhibits high entropy often correspond to critical decision points or moments of uncertainty. For instance, when an LLM is deciding which mathematical theorem to apply, its uncertainty (entropy) naturally increases. The researchers propose using this uncertainty as a guide for assigning rewards, allowing for more precise policy updates.
Group Token Policy Optimization (GTPO)
One of the key contributions is Group Token Policy Optimization (GTPO). This algorithm aims for the most fine-grained credit assignment by designing a unique, entropy-weighted reward for each individual token within a sequence. For successful sequences, tokens that were generated with higher entropy (indicating more uncertainty or exploration at that specific step) receive a relatively higher reward. This means the model is encouraged to explore and make critical decisions more effectively, rather than being penalized for minor errors at the end of a long, otherwise correct, reasoning path.
Sequence-Level Group Relative Policy Optimization (GRPO-S)
Complementing GTPO, the paper also introduces Sequence-Level Group Relative Policy Optimization (GRPO-S). While GTPO focuses on individual tokens, GRPO-S provides a lightweight alternative that adjusts the reward for an entire sequence based on its average token entropy. This method strikes a balance between performance and computational efficiency, still leveraging the insight that higher average entropy in successful sequences indicates valuable exploration.
Theoretical Foundations and Experimental Validation
The researchers provide a theoretical analysis, rooted in variance reduction arguments, to support their objective function design, demonstrating its convergence properties. This means the proposed methods are not just empirical improvements but are also mathematically sound. Experiments were conducted using the Qwen2.5-32B model, benchmarking GTPO and GRPO-S against a strong baseline called DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization).
The results were compelling: both GTPO and GRPO-S led to an increase in the model’s entropy, which in turn caused an increase in response length. More importantly, these methods significantly raised the performance ceiling of the policy, indicating that the entropy-weighting mechanism is indeed a key driver for enhancing deep reasoning in LLMs. By focusing learning signals on critical decision points, the models are encouraged to engage in deeper thinking and surpass previous performance limits.
Also Read:
- Fostering LLM Teamwork: A Reinforcement Learning Approach to Collaborative AI
- Training AI to Challenge AI: A Multi-Turn Red Teaming Strategy for LLMs
Looking Ahead
While promising, the work acknowledges some limitations. Entropy is a heuristic and might not perfectly capture reasoning importance in all scenarios. Additionally, GTPO incurs some extra computational and storage overhead for entropy calculation, though it’s deemed manageable. Future research directions include extending entropy weighting to other RL alignment algorithms like DPO, and exploring even more complex credit assignment heuristics beyond just entropy, potentially involving a lightweight credit model to predict token contributions.
In conclusion, this research highlights that designing more principled credit assignment mechanisms, particularly by leveraging the intrinsic uncertainty of models through entropy, is crucial for advancing LLMs from simple imitation to truly deep reasoning capabilities.


