Unlocking Advanced Reasoning in LLMs: A Two-Phase Learning Approach

TLDR: This research paper introduces a new understanding of how Reinforcement Learning (RL) improves Large Language Models’ (LLMs) reasoning abilities. It proposes that LLMs develop an “emergent reasoning hierarchy” with two phases: first mastering basic procedural skills, then shifting to exploring and mastering high-level strategic planning. Based on this, the authors developed HICRA (Hierarchy-Aware Credit Assignment), an RL algorithm that specifically focuses optimization on these high-impact planning tokens, significantly outperforming existing methods by fostering more effective strategic exploration.

Large Language Models (LLMs) have shown remarkable progress in complex reasoning tasks, largely thanks to Reinforcement Learning (RL). However, the exact mechanisms behind this success have remained a mystery, often leading to puzzling observations like sudden “aha moments,” improved performance with longer outputs (“length-scaling”), and complex changes in how models predict the next word (token-level entropy).

A new research paper, titled “EMERGENT HIERARCHICAL REASONING IN LLM S THROUGH REINFORCEMENT LEARNING,” by Haozhe Wang, Qixin Xu, Che Liu, Junhong Wu, Fangzhen Lin, and Wenhu Chen, sheds light on these phenomena. The authors propose that these seemingly disparate occurrences are actually hallmarks of a single, coherent process: RL helps LLMs develop an emergent functional reasoning hierarchy, much like how humans separate high-level strategic planning from low-level procedural execution. You can read the full paper here.

The Two-Phase Learning Journey

The core insight of this work is that an LLM’s learning process dynamically shifts its focus. Initially, the model concentrates on building a reliable foundation of low-level procedural skills. Think of this as mastering basic arithmetic or grammar. Once these foundational skills are solid, the learning bottleneck decisively shifts. Performance gains are then driven by the exploration and mastery of high-level strategic planning – essentially, learning how to combine those basic skills to solve more complex problems.

To analyze this, the researchers introduced “Strategic Grams” (SGs) as a way to functionally distinguish between high-level planning tokens and low-level execution tokens. Planning tokens are phrases that guide the logical flow, such as “we can use the fact that” (deduction), “let’s try a different approach” (branching), or “but the problem mentions that” (backtracing). Execution tokens, on the other hand, are concrete steps like calculations or variable substitutions.

Evidence from Training Dynamics

The paper presents compelling evidence for this two-phase dynamic across various LLM families. In the early stages of training, models rapidly improve their procedural correctness. This is observed through a sharp decrease in the “perplexity” (a measure of model surprise, indicating higher confidence) and “token-level entropy” (a measure of uncertainty) of execution tokens. This means the model quickly becomes confident and reliable in its basic operations, building a “toolbox” of procedural skills.

Once this procedural foundation is established, the focus shifts. The “semantic entropy” of strategic grams – a measure of the diversity of the model’s high-level strategic plans – shows a distinct and steady increase. This indicates that the model is actively expanding its repertoire of strategies, not just converging on one. This strategic diversification directly correlates with improved accuracy and longer, more sophisticated reasoning chains, demonstrating that strategic planning becomes the primary driver for advanced reasoning.

Explaining Puzzling Phenomena

This emergent reasoning hierarchy offers a unified explanation for previously observed behaviors:

“Aha moments” are interpreted as the model discovering and internalizing a new, powerful strategy.
“Length-scaling”, where longer outputs lead to better performance, is consistent with the use of more sophisticated strategies that naturally involve thorough planning and logical deliberation, thus elongating the reasoning trace.

Crucially, the paper highlights that “semantic entropy” is a superior metric for tracking strategic exploration compared to misleading metrics like aggregate “token-level entropy,” which can decrease even when strategic exploration is increasing.

Introducing HICRA: Hierarchy-Aware Credit Assignment

This understanding of a dynamic learning bottleneck led to the development of a new algorithm: Hierarchy-Aware Credit Assignment (HICRA). Existing RL algorithms like GRPO apply optimization pressure uniformly across all tokens, diluting the learning signal. HICRA addresses this inefficiency by concentrating optimization efforts on high-impact planning tokens.

HICRA modifies the reward signal to amplify credits for planning tokens in successful trajectories and dampen penalties for them in unsuccessful ones. This targeted approach ensures that the model’s learning capacity is focused where it matters most – on developing and reinforcing effective high-level reasoning strategies. Experiments show that HICRA significantly outperforms strong baselines across multiple models and benchmarks, validating its effectiveness in unlocking advanced reasoning.

Targeted vs. Indiscriminate Exploration

The research further justifies HICRA by comparing its targeted exploration with indiscriminate exploration methods, such as entropy regularization. While entropy regularization can increase overall token-level entropy, it often fails to improve performance because it encourages non-productive verbosity across low-level tokens. HICRA, by contrast, boosts semantic entropy, leading to better validation accuracy by focusing exploration on the strategic aspects of reasoning.

However, HICRA’s effectiveness depends on a foundational level of procedural correctness in the base model. If this foundation is lacking, HICRA’s enforced strategic exploration can be counterproductive, leading to unstable learning. This suggests future work could focus on more adaptive, model-aware hierarchical methods.

Also Read:

A New Compass for AI Development

The paper concludes by emphasizing the importance of semantic entropy as a reliable compass for measuring strategic exploration, especially when traditional metrics like token-level entropy or Pass@K can be misleading. This work not only provides a deeper understanding of how LLMs learn to reason but also offers a clear blueprint for designing more principled and efficient RL algorithms, paving the way for more advanced and robust AI reasoning capabilities.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Advanced Reasoning in LLMs: A Two-Phase Learning Approach

The Two-Phase Learning Journey

Evidence from Training Dynamics

Explaining Puzzling Phenomena

Introducing HICRA: Hierarchy-Aware Credit Assignment

Targeted vs. Indiscriminate Exploration

A New Compass for AI Development

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates