TLDR: This research paper reveals a paradox in how large language models learn to reason: while exploration is known to improve performance, common post-training methods like reinforcement learning often reinforce existing, easy reasoning paths, neglecting crucial rare ones. Using a Tree-structured Markov Chain model, the authors prove that these methods induce a “squeezing effect” that prioritizes consistency over accuracy, leading to forgetting of complex solutions. The paper demonstrates that exploration, even within the model’s existing knowledge, is vital to preserve these rare but correct reasoning paths, and proposes strategies like rejecting easy problems and KL regularization to counteract this bias.
Foundation models, the powerful AI systems underpinning many modern applications, possess vast knowledge. However, when it comes to intricate, task-specific reasoning, they often hit a wall. To overcome this, researchers employ various post-training strategies, such as Reinforcement Learning with Verifiable Rewards (RLVR) and inference scaling with Outcome or Process Reward Models (ORM/PRM).
Intriguingly, while recent studies highlight the crucial role of “exploration” and “entropy stability” in boosting performance on complex tasks, empirical evidence presents a puzzling paradox. These advanced post-training methods typically reinforce existing, well-trodden reasoning paths rather than genuinely expanding the model’s reasoning scope. This raises a fundamental question: if new reasoning patterns aren’t emerging, why does exploration help at all?
A new research paper, titled “Consistency Is Not Always Correct: Towards Understanding the Role of Exploration in Post-Training Reasoning,” by Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Bo Xue, Qingfu Zhang, Hau-San Wong, and Taiji Suzuki, delves into this very paradox. The authors propose a novel theoretical framework to understand how exploration, even when confined to the model’s existing knowledge, remains essential for solving challenging problems. You can read the full paper here.
Modeling the Mind: Tree-structured Markov Chains
To unravel this mystery, the researchers adopt a sophisticated yet understandable approach. They view each reasoning step—from simplifying a fraction to discovering a complex symmetry—as a low or high-probability transition within a Multi-task Tree-structured Markov Chain (TMC). In this model, the initial training of a foundation model is akin to “discovering” a tree-like graph of potential reasoning paths. Post-training, then, becomes a process of “reweighting” these Chain-of-Thought (CoT) paths, essentially deciding which paths are more likely to be taken.
The Squeezing Effect and the Bias Towards Consistency
Within this tractable model, the paper rigorously proves several phenomena observed in empirical studies:
- The Squeezing Effect of RLVR: Reinforcement Learning with Verifiable Rewards, while seemingly beneficial, actually induces a “squeezing effect.” This process reduces the diversity (entropy) of reasoning paths, inadvertently causing the model to “forget” some correct but less common solutions. It prioritizes paths that appear frequently correct.
- Consistency Over Accuracy: Inference scaling methods using Outcome or Process Reward Models (ORM/PRM) tend to reward consistency rather than true accuracy. This means they favor reasoning patterns that are common and frequently observed, even if these aren’t always the most accurate for every problem instance. Neural verifiers, in essence, become prone to validating what’s typical rather than what’s truly correct.
- The Merit of Rare Thoughts: The paper highlights that difficult problem instances are often solved by “rare, high-uncertainty” Chains-of-Thought generated by the base model. These are the less obvious, less frequent reasoning paths that hold the key to complex solutions. However, these crucial rare CoTs are precisely what get squeezed out by RLVR or are unfavored by consistency-seeking inference scaling.
Why Exploration is Indispensable
The collective weight of these findings offers a powerful resolution to the initial paradox. Exploration, even if it doesn’t lead to entirely new reasoning structures, is vital because it preserves access to these rare but crucial Chains-of-Thought. Without exploration, these unique paths, essential for tackling difficult cases, would be lost or overlooked by post-training methods that inadvertently prioritize commonality and simplicity.
Strategies to Foster Deeper Reasoning
Building on their theoretical insights, the researchers propose and prove the effectiveness of several exploration strategies:
- Rejecting Easy Instances: By actively discarding instances that are easily solved by existing, well-learned CoTs, models are compelled to focus on harder problems. This curriculum learning approach helps preserve and reinforce the rare CoTs needed for complex challenges.
- KL Regularization: Incorporating KL (Kullback-Leibler) regularization during training helps maintain the diversity of reasoning paths. This prevents the model from collapsing into a narrow set of highly confident, but potentially incomplete, solutions, thereby preserving its broad problem-solving capabilities across multiple tasks.
- Gibbs Sampling (Soft-BoN/DPRM-AS): For inference scaling, methods like Soft Best-of-N (Soft-BoN) and Doob’s h-Transform-induced Process Reward Model (DPRM-AS) offer a principled way to balance reward maximization with maintaining the base model’s inherent diversity. These approaches can be adjusted to ensure that rare but valuable CoTs are not overlooked during the solution generation process.
Empirical Validation
The theoretical findings are not left in abstraction. Empirical simulations on an abstract Tree-structured Markov Chain model corroborate the results. These simulations clearly show that standard RL fine-tuning and ORM/PRM-based inference methods heavily favor easy-to-reason CoTs, leading to a “simplicity bias” and a “forgetting” effect on secondary tasks. In contrast, diversity-promoting methods like rejecting easy instances, KL-regularized GRPO, Soft-BoN, and DPRM-AS successfully balance easy and hard reasoning paths, while also preserving the model’s ability to perform across multiple tasks.
Also Read:
- Enhancing LLM Reasoning: A New Method to Overcome Repetitive Reflections
- Aligning AI Thoughts with Human Logic: A Deep Dive into CoT
Looking Ahead
This research offers a significant step towards understanding the intricate dynamics of post-training reasoning in foundation models. While acknowledging limitations such as the abstract nature of the TMC framework and the complexities of real-world large-scale models, the paper provides crucial insights. It underscores that for AI to truly excel at complex reasoning, strategies must actively counteract the inherent bias towards simplicity and consistency, ensuring that the valuable “rare thoughts” are not just preserved, but actively nurtured.


