TLDR: MASPRM is a novel Process Reward Model designed for Multi-Agent Systems (MAS) that provides per-action, per-agent value estimates to guide inference-time search. It learns from MAS-MCTS rollouts without human annotations, enabling more reliable and compute-aware multi-agent reasoning. The model significantly improves exact match accuracy on GSM8K and MATH benchmarks, and demonstrates zero-shot transferability to new datasets, making MAS more efficient and effective.
Multi-Agent Systems (MAS), where multiple specialized AI agents collaborate to solve complex problems, hold immense promise for advancing artificial intelligence. However, deploying these systems reliably in real-world scenarios has been a significant challenge. Traditional methods often struggle with sparse feedback, meaning they only know if the final answer is right or wrong, offering little guidance on which intermediate steps were helpful. This can lead to errors propagating through the system, wasting computational resources on unpromising paths.
A new research paper introduces a novel solution to this problem: the Multi-Agent System Process Reward Model, or MASPRM. This innovative model acts as an inference-time controller, providing real-time, per-action, and per-agent value estimates to guide the MAS through its problem-solving process. Imagine a team of experts working on a project; MASPRM is like a smart manager who can assess the value of each expert’s contribution at every step, ensuring the team stays on the most productive track.
How MASPRM Works
At its core, MASPRM estimates the value of an intermediate state within the multi-agent dialogue. This means it can tell how much closer a specific message or action from an agent brings the system to a correct solution. Crucially, MASPRM is trained using a technique called “search-generated supervision” from multi-agent Monte Carlo Tree Search (MCTS) rollouts. This is a significant advantage because it doesn’t require tedious, step-by-step human annotations. Instead, it learns by propagating the final outcome rewards back to the individual actions, effectively teaching itself what good progress looks like.
The researchers highlight several unique challenges in developing such a model for MAS compared to single-agent systems:
- Granular Steps: In MAS, a single turn can involve multiple substeps like planning, tool invocation, or cross-agent summarization. MASPRM defines scores at the level of inter-agent states, not just individual tokens.
- Schedule and Topology Dependence: The value of an intermediate state depends on which agent acts next and what operation they perform, a complexity MASPRM accounts for.
- Heterogeneous Agents: MAS often involve agents with different roles, tools, and even base models. MASPRM is designed to be robust to these varying agent identities and capabilities.
- Partial Observability: Agents in a MAS typically only see a subset of the global state. MASPRM scores each inter-agent state based solely on the information available to the acting agent at that moment.
Guiding Inference and Achieving Results
During the inference phase (when the MAS is actively solving a problem), MASPRM guides two main search strategies: step-level beam search (SBS) and value-guided MCTS. It focuses computation on promising branches and prunes unproductive ones early, making the entire process more efficient.
The results of MASPRM are impressive. Tested on challenging math benchmarks, GSM8K and MATH, MASPRM-guided decoding, especially when combined with an outcome reward model (ORM) that evaluates the final answer, showed substantial improvements in exact match (EM) accuracy. On GSM8K, it boosted EM by +30.7 points over a single straight-through MAS pass, and on MATH, it achieved a +22.9 point gain. Furthermore, a MASPRM trained on GSM8K demonstrated remarkable zero-shot transfer capabilities, improving EM on MATH by 8.4 points without any retraining, indicating that its learned progress signals are broadly applicable.
Also Read:
- Enhancing AI Reasoning with Process Mining: Introducing PM4GRPO
- Unlocking Complex Reasoning in LLMs with Step-wise Supervised Reinforcement Learning
A Complementary and Efficient Approach
The research demonstrates that MASPRM-guided inference consistently outperforms policy-only baselines and even outcome-only models when compute budgets are matched. The combination of MASPRM’s process-level guidance during search and an ORM for final answer verification yielded the strongest accuracies. This makes MASPRM a valuable plug-in model that can enhance existing multi-agent workflows, making them more reliable and compute-aware without altering their base policies or requiring extensive manual annotations.
For more details, you can read the full research paper here.


