Boosting LLM Reasoning with Intrinsic Curiosity

TLDR: This research introduces Curiosity-Driven Exploration (CDE), a new framework to improve Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). CDE addresses poor exploration in RLVR by using intrinsic curiosity signals from both the actor (perplexity of generated responses) and the critic (variance of value estimates from a multi-head architecture). This approach penalizes overconfident errors, promotes diverse correct responses, and guides the model to explore uncertain regions. Empirically, CDE achieves significant performance gains on mathematical reasoning benchmarks and mitigates ‘calibration collapse,’ a phenomenon where LLM confidence decouples from correctness.

Large Language Models (LLMs) have shown incredible potential in complex tasks like mathematics and coding, largely thanks to advancements in training paradigms such as Reinforcement Learning with Verifiable Rewards (RLVR). RLVR allows models to be optimized directly based on the correctness of their final answers, simplifying the training process by removing the need for intricate reward models. However, current RLVR methods often struggle with a fundamental challenge in reinforcement learning: exploration. This leads to issues like premature convergence, where the model settles on suboptimal solutions too quickly, and entropy collapse, indicating a lack of diverse exploration.

The Exploration Dilemma in LLMs

The core problem lies in the exploration-exploitation dilemma. Existing exploration strategies, such as simple entropy bonuses or epsilon-greedy policies, often fall short when applied to the vast and complex environments of LLMs. More principled methods, like count-based exploration, which reward visiting rarely seen states, face their own hurdles. These methods typically require computationally intensive operations or rely on highly expressive representations of reasoning paths, which become impractical for the long and intricate chains of thought generated by LLMs. Attempts to simplify count-based methods, such as using hash-based pseudo-counts, have also proven problematic, as complex reasoning trajectories often collapse into similar hash grids, undermining their effectiveness.

Introducing Curiosity-Driven Exploration (CDE)

To address these challenges, researchers Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, and Dong Yu introduced a novel framework called Curiosity-Driven Exploration (CDE). This approach draws inspiration from how children learn, driven by an intrinsic sense of curiosity rather than external counts of experiences. CDE formalizes this curiosity using signals from both the ‘actor’ (the part of the LLM that generates responses) and the ‘critic’ (the part that estimates the value of those responses).

Actor Curiosity: Learning from Surprise

For the actor, curiosity is measured by the perplexity (PPL) of its generated response. Perplexity essentially quantifies how ‘surprising’ a generated sentence is to the model. A higher perplexity indicates that the response is less probable under the model’s current policy, suggesting it might be an underexplored region. This perplexity signal is then used as an exploration bonus, added to the original reward. To prevent the model from simply generating random, high-perplexity but low-quality responses (a phenomenon known as reward hacking), an adaptive clipping mechanism is used. This ensures the bonus remains a fraction of the original reward, encouraging exploration without sacrificing quality. Theoretically, this perplexity-based bonus inherently penalizes overconfident errors and promotes diversity among correct responses, leading to better-calibrated models.

Critic Curiosity: Navigating Uncertainty

The critic’s curiosity is derived from the uncertainty in its value estimates. In actor-critic frameworks, the critic estimates the expected future reward. Regions with sparse data lead to higher uncertainty in these estimates. CDE approximates this uncertainty using a multi-head critic architecture, where multiple ‘heads’ (individual critics) share a common LLM backbone but are trained on different subsets of data. The standard deviation across the value estimates from these multiple heads serves as a principled curiosity signal. High disagreement among the heads indicates an under-explored region, prompting the policy to investigate further. This multi-head critic bonus has been theoretically linked to classical count-based exploration bonuses, providing a scalable and efficient way to guide exploration in complex LLM environments.

Empirical Success and Key Insights

The CDE framework demonstrated consistent performance gains across various challenging mathematical reasoning benchmarks, including AIME, AMC, and MATH. It achieved an approximate +3 point improvement over standard RLVR methods like GRPO and PPO on AIME benchmarks. Notably, multi-head PPO consistently outperformed vanilla PPO, with performance generally increasing with the number of heads. The research also highlighted the importance of a dynamic bonus weight, showing that decaying the exploration bonus over time is crucial for stable convergence, allowing for aggressive exploration early on and a gradual shift to exploitation. A significant finding was the identification of a phenomenon called ‘calibration collapse’ in naive RLVR training, where the model’s confidence decouples from its correctness. The PPL bonus in CDE was shown to mitigate this, helping the model remain confident when correct and cautious when incorrect, thereby improving its overall calibration and interpretability.

Also Read:

Conclusion

Curiosity-Driven Exploration offers a lightweight yet powerful approach to enhance reinforcement learning for LLMs. By leveraging intrinsic curiosity signals from both the actor and the critic, CDE effectively addresses the challenges of poor exploration, leading to more stable training and improved reasoning abilities in LLMs. The framework’s empirical success and theoretical underpinnings suggest a promising direction for future research, particularly in refining reward design and tackling issues like LLM hallucination. For more details, you can read the full research paper here: CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Boosting LLM Reasoning with Intrinsic Curiosity

The Exploration Dilemma in LLMs

Introducing Curiosity-Driven Exploration (CDE)

Actor Curiosity: Learning from Surprise

Critic Curiosity: Navigating Uncertainty

Empirical Success and Key Insights

Conclusion

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates