TLDR: DeepSearch is a new framework that integrates Monte Carlo Tree Search (MCTS) directly into the training process of Reinforcement Learning with Verifiable Rewards (RLVR) for large language models. This approach overcomes the common problem of training plateaus caused by limited exploration, enabling models to systematically explore reasoning paths. DeepSearch achieves state-of-the-art accuracy on mathematical reasoning tasks with significantly less computational effort compared to traditional extended training methods.
Large language models (LLMs) are becoming increasingly adept at complex reasoning tasks, a capability often boosted by a technique called Reinforcement Learning with Verifiable Rewards (RLVR). However, a significant challenge has emerged: after thousands of training steps, these models often hit a performance ceiling, showing diminishing returns despite massive computational investment. This limitation stems from how models currently explore solutions during training, often missing crucial reasoning paths.
Enter DeepSearch, a groundbreaking framework that aims to overcome this bottleneck by embedding Monte Carlo Tree Search (MCTS) directly into the RLVR training process. Unlike previous methods that only use tree search during inference (when the model is generating an answer), DeepSearch integrates this structured exploration into the very heart of training. This fundamental shift allows models to systematically explore a wider range of reasoning possibilities and assign credit more precisely to individual steps in a solution.
Addressing the Exploration Bottleneck
The core idea behind DeepSearch is to focus on “training-time exploration.” Traditional RLVR relies on limited “rollouts” – essentially, trying out a few reasoning paths. If these paths are too narrow, the model doesn’t learn how to navigate the full complexity of a problem. DeepSearch, by contrast, uses MCTS to expand the reasoning frontier systematically during training, providing richer learning signals than just knowing if the final outcome was correct.
DeepSearch introduces several key innovations to achieve this:
Global Frontier Selection: Instead of just looking at immediate next steps, DeepSearch prioritizes the most promising nodes across the entire search tree. This helps the model avoid getting stuck in suboptimal local solutions.
Entropy-based Guidance: This mechanism helps identify “confident negative examples” – incorrect reasoning paths where the model was surprisingly certain. Learning from these confident mistakes provides valuable supervision.
Adaptive Replay Buffer: To ensure efficiency, DeepSearch uses a smart replay buffer that stores correct solutions found earlier. This prevents redundant computation for already-solved problems and focuses MCTS on truly challenging ones.
How DeepSearch Works
At its heart, DeepSearch modifies the MCTS framework. When exploring a problem, it generates multiple candidate reasoning steps. If a correct solution is found, that path is reinforced. If not, and there are incorrect paths, it identifies the most “confident” incorrect path (one where the model was least uncertain in its wrong decisions) to learn from. This fine-grained feedback, combined with a “heuristic score backup” system, helps update the model’s understanding of good and bad reasoning steps.
The “hybrid selection strategy” is crucial. It uses a local Upper Confidence Bounds for Trees (UCT) algorithm for comparing sibling reasoning steps, but also employs a novel global frontier selection. This global approach evaluates all potential expansion points across the entire search tree simultaneously, using a “frontier priority score” that considers the quality of parent nodes, the uncertainty of the current step, and the depth of the search.
Efficiency and Performance
DeepSearch was rigorously tested on challenging mathematical reasoning benchmarks, including AIME, AMC, MATH, Minerva, and Olympiad problems. The results are impressive: DeepSearch-1.5B achieved an average accuracy of 62.95%, setting a new state-of-the-art for 1.5 billion parameter reasoning models. This represents a 1.25 percentage point improvement over the previous best model, Nemotron-Research-Reasoning-Qwen-1.5B v2.
Perhaps even more striking is its computational efficiency. DeepSearch achieved these superior results using 5.7 times fewer GPU hours than extended training approaches that simply scale up the number of training steps. For instance, DeepSearch reached its peak performance in 330 GPU hours, while an extended training baseline consumed 1,883.2 GPU hours for a slightly lower accuracy. This highlights that strategic exploration, rather than brute-force computation, is key to advancing RLVR methodologies.
The framework’s adaptive training strategy, which progressively filters challenging problems and reuses cached solutions, plays a vital role in this efficiency. This ensures that computational resources are focused where they are most needed, preventing the model from “forgetting” previously mastered problems.
Also Read:
- Adaptive Monte Carlo Search: A Dynamic Approach to Training Mathematical Reasoning AI
- DIVER: A New Approach to Enhance LLM Reasoning Through Diverse Exploration
A New Direction for AI Reasoning
DeepSearch represents a significant step forward in how large language models learn to reason. By bridging the gap between sophisticated inference-time search capabilities and the training process, it offers a new paradigm for scaling RLVR. This work suggests that the future of reasoning model development lies not just in making models bigger or training them longer, but in fundamentally rethinking how we structure the learning process to mirror the complex reasoning patterns we expect from advanced AI. For more details, you can read the full research paper here.


