TLDR: Researchers introduce VideoRewardBench, a new comprehensive benchmark for evaluating multimodal reward models (MRMs) in video understanding. It addresses limitations of previous benchmarks by offering a large, diverse dataset covering perception, knowledge, reasoning, and safety. Evaluations of 28 MRMs reveal significant performance gaps, with even leading models achieving only moderate accuracy, particularly struggling with short-form perception, knowledge, and reasoning tasks. The study also provides insights into how inference-time scaling and video frame counts affect different MRM types.
Multimodal Reward Models (MRMs) are becoming increasingly vital for the development, training, and evaluation of Large Vision Language Models (LVLMs). These models are designed to assess the quality of AI-generated responses, helping to align them with human preferences. However, evaluating MRMs, especially in the complex domain of video understanding, has faced significant challenges.
Existing benchmarks for video-based MRMs have been limited in several ways. They often feature a small number of questions, lack diversity in question types, don’t cover a wide range of evaluation dimensions, and fail to thoroughly analyze different categories of MRMs. These gaps have made it difficult to truly understand the capabilities and limitations of these advanced AI systems.
To address these critical issues, researchers have introduced a groundbreaking new benchmark called VideoRewardBench. This is the first comprehensive benchmark specifically designed to evaluate multimodal reward models in video understanding. It covers four crucial aspects: perception (how well models understand what they see), knowledge (their ability to apply specialized information), reasoning (their capacity for logical thought), and safety (their awareness of potentially harmful content).
VideoRewardBench boasts a high-quality preference dataset of 1,563 annotated samples, including 1,482 unique videos and 1,559 distinct questions. This is a significant leap, offering 15 times more questions than the most extensive prior benchmark. Each sample in the dataset is a triplet: a video-text prompt, a ‘chosen’ (preferred) response, and a ‘rejected’ (less preferred) response. The dataset was curated using an AI-assisted data pipeline to ensure quality and difficulty.
A thorough evaluation was conducted across 28 different multimodal reward models, spanning three main categories: generative, discriminative, and semi-scalar. The results from VideoRewardBench highlight significant limitations in current MRMs. Even top-performing proprietary models like GPT-4o achieved only 57.0% overall accuracy, while the leading open-source model, Qwen2.5-VL-72B, reached merely 53.3%. This indicates a substantial performance gap and shows that even models specifically trained for reward modeling still lag behind the best proprietary solutions.
The analysis further revealed that most models struggle particularly in short-form perception, knowledge, and reasoning tasks. This suggests that these areas represent significant challenges for current MRMs, even for some ‘slow-thinking’ critic-trained models that might be expected to perform better in complex reasoning.
Interestingly, the study also provided insights into how different factors affect MRM performance:
Inference-Time Scaling
Inference-time scaling, a strategy where multiple responses are sampled and aggregated, generally improved performance for generative and semi-scalar MRMs. For instance, Claude-3.7-Sonnet saw a 10.6% improvement, and RL-trained models like R1-Reward benefited significantly more (14.3% gain) compared to non-critic-trained base models. However, discriminative MRMs, which output deterministic scores, did not show performance gains from this method.
Also Read:
- New Benchmark Challenges AI’s Understanding of Space
- SoccerHigh: Advancing Automatic Highlight Generation for Soccer Videos
Impact of Video Frame Count
The number of input video frames also had varying effects on different MRM types. Critic-trained generative MRMs showed a clear performance improvement as more frames were provided. For example, LLaVA-Critic-72B improved from 52.0% to 63.0% when frame count increased from 1 to 64. In contrast, non-critic-trained generative MRMs showed less pronounced gains, and semi-scalar MRMs were least affected by frame count variations.
The findings from VideoRewardBench offer valuable insights for the future development of multimodal reward models in video understanding. This challenging benchmark is expected to drive significant advancements in the field. You can explore the dataset and code further at the research paper’s link.


