TLDR: This research introduces Curiosity-Driven Exploration (CDE), a new framework to improve Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). CDE addresses poor exploration in RLVR by using intrinsic curiosity signals from both the actor (perplexity of generated responses) and the critic (variance of value estimates from a multi-head architecture). This approach penalizes overconfident errors, promotes diverse correct responses, and guides the model to explore uncertain regions. Empirically, CDE achieves significant performance gains on mathematical reasoning benchmarks and mitigates ‘calibration collapse,’ a phenomenon where LLM confidence decouples from correctness.
Large Language Models (LLMs) have shown incredible potential in complex tasks like mathematics and coding, largely thanks to advancements in training paradigms such as Reinforcement Learning with Verifiable Rewards (RLVR). RLVR allows models to be optimized directly based on the correctness of their final answers, simplifying the training process by removing the need for intricate reward models. However, current RLVR methods often struggle with a fundamental challenge in reinforcement learning: exploration. This leads to issues like premature convergence, where the model settles on suboptimal solutions too quickly, and entropy collapse, indicating a lack of diverse exploration.
The Exploration Dilemma in LLMs
The core problem lies in the exploration-exploitation dilemma. Existing exploration strategies, such as simple entropy bonuses or epsilon-greedy policies, often fall short when applied to the vast and complex environments of LLMs. More principled methods, like count-based exploration, which reward visiting rarely seen states, face their own hurdles. These methods typically require computationally intensive operations or rely on highly expressive representations of reasoning paths, which become impractical for the long and intricate chains of thought generated by LLMs. Attempts to simplify count-based methods, such as using hash-based pseudo-counts, have also proven problematic, as complex reasoning trajectories often collapse into similar hash grids, undermining their effectiveness.
Introducing Curiosity-Driven Exploration (CDE)
To address these challenges, researchers Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, and Dong Yu introduced a novel framework called Curiosity-Driven Exploration (CDE). This approach draws inspiration from how children learn, driven by an intrinsic sense of curiosity rather than external counts of experiences. CDE formalizes this curiosity using signals from both the ‘actor’ (the part of the LLM that generates responses) and the ‘critic’ (the part that estimates the value of those responses).
Actor Curiosity: Learning from Surprise
For the actor, curiosity is measured by the perplexity (PPL) of its generated response. Perplexity essentially quantifies how ‘surprising’ a generated sentence is to the model. A higher perplexity indicates that the response is less probable under the model’s current policy, suggesting it might be an underexplored region. This perplexity signal is then used as an exploration bonus, added to the original reward. To prevent the model from simply generating random, high-perplexity but low-quality responses (a phenomenon known as reward hacking), an adaptive clipping mechanism is used. This ensures the bonus remains a fraction of the original reward, encouraging exploration without sacrificing quality. Theoretically, this perplexity-based bonus inherently penalizes overconfident errors and promotes diversity among correct responses, leading to better-calibrated models.
Critic Curiosity: Navigating Uncertainty
The critic’s curiosity is derived from the uncertainty in its value estimates. In actor-critic frameworks, the critic estimates the expected future reward. Regions with sparse data lead to higher uncertainty in these estimates. CDE approximates this uncertainty using a multi-head critic architecture, where multiple ‘heads’ (individual critics) share a common LLM backbone but are trained on different subsets of data. The standard deviation across the value estimates from these multiple heads serves as a principled curiosity signal. High disagreement among the heads indicates an under-explored region, prompting the policy to investigate further. This multi-head critic bonus has been theoretically linked to classical count-based exploration bonuses, providing a scalable and efficient way to guide exploration in complex LLM environments.
Empirical Success and Key Insights
The CDE framework demonstrated consistent performance gains across various challenging mathematical reasoning benchmarks, including AIME, AMC, and MATH. It achieved an approximate +3 point improvement over standard RLVR methods like GRPO and PPO on AIME benchmarks. Notably, multi-head PPO consistently outperformed vanilla PPO, with performance generally increasing with the number of heads. The research also highlighted the importance of a dynamic bonus weight, showing that decaying the exploration bonus over time is crucial for stable convergence, allowing for aggressive exploration early on and a gradual shift to exploitation. A significant finding was the identification of a phenomenon called ‘calibration collapse’ in naive RLVR training, where the model’s confidence decouples from its correctness. The PPL bonus in CDE was shown to mitigate this, helping the model remain confident when correct and cautious when incorrect, thereby improving its overall calibration and interpretability.
Also Read:
- Reinforcement Learning Unlocks Advanced Reasoning in Large Language Models
- Tree-OPO: A New Path for Multistep Reasoning in LLMs
Conclusion
Curiosity-Driven Exploration offers a lightweight yet powerful approach to enhance reinforcement learning for LLMs. By leveraging intrinsic curiosity signals from both the actor and the critic, CDE effectively addresses the challenges of poor exploration, leading to more stable training and improved reasoning abilities in LLMs. The framework’s empirical success and theoretical underpinnings suggest a promising direction for future research, particularly in refining reward design and tackling issues like LLM hallucination. For more details, you can read the full research paper here: CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models.


