Unpacking AI's Inner Workings: How Training Shapes Reasoning Circuits

TLDR: This research paper investigates how post-training methods like supervised fine-tuning (SFT) and reinforcement learning (RL) alter the internal architecture of large reasoning models by causing “emergent attention heads” to appear. SFT and distillation lead to a cumulative addition of stable reasoning heads, which can improve complex problem-solving but also introduce “overthinking” errors. In contrast, RL (GRPO) dynamically activates and prunes heads, optimizing existing pathways. The study also found that “think on/off” models don’t have dedicated “thinking” heads; instead, “think off” activates a broader, less efficient compensatory network. These findings reveal a trade-off between complex reasoning capabilities and reliable elementary computations, guiding future AI training design.

Large reasoning models, the powerhouses behind many advanced AI capabilities, have shown remarkable prowess in tackling complex, multi-step problems. However, the inner workings of how these models achieve such feats, especially after specialized training, have largely remained a mystery. A recent research paper titled “THINKINGSPARKS!: EMERGENTATTENTIONHEADS INREASONINGMODELSDURINGPOSTTRAINING” by Yein Park, Minbyul Jeong, and Jaewoo Kang delves into this very enigma, shedding light on the architectural changes that occur within these models during post-training. You can read the full paper here.

The researchers used a technique called circuit analysis to peer inside the “black box” of large reasoning models. Their focus was on understanding how post-training methods, such as supervised fine-tuning (SFT) and reinforcement learning (RL), lead to the emergence of specialized “attention heads.” These attention heads are essentially components within the model’s neural network that learn to focus on specific parts of the input, and the study reveals they play a crucial role in structured reasoning and computation.

Different Training, Different Heads

The study compared various training regimes, primarily focusing on Qwen families of models. They found that the way these emergent attention heads evolve differs significantly based on the training method:

Distillation and Supervised Fine-Tuning (SFT): These methods tend to foster a cumulative addition of stable reasoning heads. Distillation, which involves a smaller model learning from a larger “teacher” model, and SFT, where models are fine-tuned on curated datasets, both introduce a substantial number of new heads. These heads are often found in early-to-mid layers for distillation and mid-to-late layers for SFT. While these new heads enable sophisticated problem-solving, the researchers also observed a potential downside: “over-thinking” failure modes, where models might make calculation errors or get stuck in logical loops on simpler tasks. This suggests a trade-off where complex reasoning can sometimes come at the cost of elementary computations.
Group Relative Policy Optimization (GRPO): In contrast to the additive nature of SFT and distillation, GRPO, a reinforcement learning algorithm, operates in a more dynamic “search mode.” Here, attention heads are iteratively activated, evaluated, and pruned based on how well they contribute to the task reward signal. This results in a smaller, more targeted set of emergent heads. GRPO acts like a “scalpel,” optimizing existing knowledge and computational pathways rather than building entirely new ones. This dynamic process reflects an explore-exploit trade-off, where the model constantly refines its architecture to favor effective strategies.

The “Think On/Off” Mechanism

The paper also investigated models with a controllable “think on/off” feature, like Qwen3-8B. Surprisingly, the “think on” mode doesn’t rely on a dedicated set of “thinking” heads. Instead, it leverages an optimized circuit already embedded within its structure. When explicit reasoning is turned “off,” the model compensates by activating a broader, but less efficient, set of compensatory heads. This observation suggests that the model has internalized a highly efficient mechanism for selecting reasoning pathways. Ablating or scaling down these “think off” heads in certain scenarios actually improved performance, especially for complex tasks, by clarifying the model’s reasoning pathways. However, these numerous emergent heads in “think off” mode can also be a crucial asset for robust problem-solving when a larger computational budget is available, allowing for the exploration of diverse computational pathways.

Also Read:

Connecting Circuits to Performance

The findings establish a clear link between these circuit-level dynamics and the model’s overall performance. Strengthened heads enable sophisticated problem-solving strategies for difficult problems. However, this can also introduce “over-thinking” issues on simpler tasks. This inherent tension highlights a critical challenge in training policy design: balancing the development of effective reasoning strategies with the assurance of reliable, flawless execution across a range of tasks.

This research provides a valuable mechanistic perspective on how post-training fundamentally reshapes the internal architecture of reasoning models. It moves beyond high-level performance metrics to offer a granular understanding of the functional changes, paving the way for more principled, interpretable, and robust training strategies for the next generation of AI models.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking AI’s Inner Workings: How Training Shapes Reasoning Circuits

Different Training, Different Heads

The “Think On/Off” Mechanism

Connecting Circuits to Performance

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates