TLDR: A study on PRM-guided tree search for mathematical reasoning in LLMs found that it did not significantly outperform simpler Best-of-N methods, despite higher computational costs. Key reasons include the Process Reward Models’ (PRMs) poor approximation of intermediate state values, decreasing reliability with reasoning depth, and limited generalization to out-of-distribution problems. Monte Carlo Tree Search and beam search were the most effective tree search methods, but the overall findings suggest that current PRMs are insufficient to effectively guide complex, dynamic mathematical reasoning in LLMs.
Large Language Models (LLMs) have made impressive strides in various domains, including mathematical reasoning. Traditionally, methods like Chain-of-Thought (CoT) prompting combined with Best-of-N (BoN) selection have been popular. CoT breaks down problems into sequential steps, and BoN picks the best solution from several generated candidates, often guided by a Process Reward Model (PRM) that evaluates each step.
However, mathematical problem-solving isn’t always a straight line. It often involves exploring multiple strategies, trying out partial solutions, and backtracking when errors occur. This branching, exploratory nature isn’t fully captured by the linear structure of Chain-of-Thought. This is where the idea of tree search comes in, aiming to mimic human-like exploration by evaluating multiple partial reasoning paths.
A recent study investigated whether PRM-guided tree search could enhance mathematical reasoning in LLMs by allowing them to explore these diverse paths. The researchers developed an adaptive algorithm to maximize PRM scores across the complex action space inherent in tree search. They used the Qwen2.5-Math-7B-Instruct LLM and its associated Qwen2.5-Math-PRM-7B Process Reward Model as a case study, testing various tree search algorithms against Best-of-N across 23 different mathematical problems.
The findings revealed some critical limitations. Firstly, despite the higher computational costs associated with tree search methods, they did not show statistically significant improvements over the simpler Best-of-N approach. This means that for the given LLM and PRM, the added complexity and expense of tree search didn’t translate into better accuracy.
Among the different PRM-guided tree search methods, Monte Carlo Tree Search (MCTS) and beam search emerged as the top performers. However, even these methods couldn’t consistently outperform Best-of-N. The study also highlighted a significant issue with the Process Reward Models themselves: they struggled to accurately estimate the value of intermediate reasoning steps. Their reliability tended to degrade as the reasoning depth increased, suggesting problems with how credit was assigned to earlier steps in a long reasoning chain.
Furthermore, the PRMs demonstrated poor generalization capabilities. Their effectiveness was notably higher on problems similar to those they were trained on (in-distribution) compared to new, unfamiliar problem types (out-of-distribution). This generalization gap persisted across most reasoning steps, limiting the practical utility of PRM-guided tree search in diverse mathematical contexts.
The core reason for this underperformance, as identified by the researchers, is tree search’s greater reliance on these unreliable intermediate PRM scores to guide its exploration. In contrast, Best-of-N primarily evaluates only complete solutions, making it less susceptible to the inaccuracies of intermediate step evaluations. These results suggest that while tree search holds promise for complex reasoning, current Process Reward Models may not be accurate enough to effectively guide such dynamic exploration in LLMs. Future advancements in mathematical reasoning with LLMs might require the development of more robust and reliable reward models.
Also Read:
- Beyond Final Answers: How RLVR Affects LLM Reasoning Traces
- Mapping the Thought Process of Language Models in Math
You can read the full research paper for more technical details and experimental results here: Limits of PRM-Guided Tree Search for Mathematical Reasoning with LLMs.


