TLDR: This paper introduces new methods for training Vision-Language Process Reward Models (VL-PRMs) to improve multimodal reasoning. It proposes a hybrid data synthesis framework combining MCTS with VLM judgments and perception-focused supervision. The study evaluates various test-time scaling strategies, finding that VL-PRMs, especially when used for one-shot full-solution evaluation, significantly boost VLM performance across diverse tasks, uncover latent reasoning abilities, and benefit from explicit perception error detection, even with smaller model sizes.
A new research paper delves into the intricate world of Vision-Language Process Reward Models (VL-PRMs), aiming to enhance the reasoning capabilities of large language models when dealing with both visual and textual information. Process Reward Models (PRMs) are crucial because they provide feedback at each step of a model’s reasoning process, rather than just at the end. This step-by-step guidance helps improve the reliability and accuracy of complex reasoning tasks, especially in areas prone to errors like mathematical problem-solving or abstract visual puzzles.
While PRMs have been widely explored in text-only scenarios, their application to Vision-Language Models (VLMs) has been limited. Existing VL-PRMs often rely on a technique called Monte Carlo Tree Search (MCTS) for creating training data. However, this method can sometimes lead to noisy supervision signals, which might hinder a model’s ability to generalize across different tasks. This new work seeks to broaden our understanding of VL-PRMs by exploring various strategies for building datasets, training models, and scaling their performance during testing.
Innovations in Data and Training
The researchers introduce a novel hybrid data synthesis framework. This framework combines the MCTS approach with judgments from a powerful VLM, specifically ‘o4-mini’, to generate more accurate step-level labels for training. This means the training data is of higher quality, leading to better-trained VL-PRMs. A key aspect of this new approach is ‘perception-focused supervision’. This allows the PRM to explicitly identify errors that occur at the visual grounding stage of reasoning – essentially, when the model is trying to understand what it sees in an image. Errors at this early stage can propagate and lead to incorrect final answers, so detecting them early is vital.
To support their research, the team developed a new dataset called VL-PRM300K. This dataset contains approximately 300,000 image-question pairs and 1.32 million step-level samples. Unlike previous datasets that heavily focused on advanced math, VL-PRM300K includes a diverse range of visual question answering (VQA) skills, such as document and chart understanding, OCR, general commonsense knowledge, grade-school science, elementary math, and a significant portion of abstract reasoning problems from the RAVEN dataset. This broader dataset aims to improve the generalizability of VL-PRMs across various multimodal reasoning tasks.
Strategies for Test-Time Scaling
The paper systematically evaluates multiple strategies for using VL-PRMs during inference, known as test-time scaling (TTS). These strategies guide VLMs towards more accurate solutions:
- Guided Greedy Search: At each step of generating a solution, the VLM proposes several candidate next steps. The VL-PRM scores each candidate, and the highest-scoring step is chosen, guiding the VLM’s generation process.
- One-shot Search: Instead of evaluating individual steps, the VLM generates several complete candidate solutions. The VL-PRM then scores each full solution in a single pass, and the highest-scoring complete solution is selected. This method is often more computationally efficient.
- Step-Score Aggregation: The VLM generates complete solutions, and the VL-PRM assigns a probability of correctness to each step within these solutions. These step-level scores are then aggregated (e.g., averaged) to determine an overall score for each solution, with the highest-scoring solution being chosen.
Also Read:
- Decoding How AI Understands the World: A Multimodal Perspective
- Decoding LLM’s Visual Intuition from Language Pre-training
Key Insights and Findings
The experiments, conducted across five diverse multimodal benchmarks (MMMU, PuzzleVQA, AlgoPuzzleVQA, MathVista, and MathVision), revealed several significant insights:
- When VL-PRMs are used as Outcome Reward Models (ORMs) during test-time scaling (specifically, the One-shot Search strategy), they can outperform methods that guide step-by-step selection. This suggests that evaluating the entire solution holistically can be more effective.
- Surprisingly, smaller VL-PRMs can be as effective as, or even surpass, larger models in detecting process errors. For instance, a 3B parameter VL-PRM outperformed a 7B variant in error detection.
- VL-PRMs have the ability to uncover latent reasoning abilities in stronger VLM backbones. Models that perform similarly without PRM guidance show substantial gains when guided by VL-PRMs, indicating that PRMs help these models explore more reliable reasoning paths.
- Perception-level supervision leads to significant improvements in test-time scaling performance. Explicitly training VL-PRMs to detect visual grounding errors is crucial.
- The performance of different TTS policies improves on advanced math reasoning datasets, even though the VL-PRMs were not specifically trained on such complex math datasets. This suggests that training on general VQA and abstract reasoning tasks enhances general logical reasoning, which then benefits mathematical reasoning.
The research also highlights that using a strong external VLM (like o4-mini) as a judge for step-level correctness during dataset construction is substantially more effective than relying solely on MCTS-derived scores. Furthermore, the One-shot Search strategy consistently outperformed Guided Greedy Search and Step-score Aggregation, particularly because Step-score Aggregation tended to overestimate confidence in incorrect solutions due to many correct intermediate steps.
This work, detailed in the paper available at arXiv, motivates further research and supports the advancement of Vision-Language Models by providing a comprehensive exploration of VL-PRM design, training, and test-time scaling strategies. The findings underscore the potential of VL-PRMs to significantly improve multimodal reasoning capabilities.


