Unlocking Deeper AI Understanding of Human Videos with HV-MMBench

TLDR: HV-MMBench is a new, comprehensive benchmark designed to evaluate Multimodal Large Language Models (MLLMs) in understanding human-centric videos. It features 15 diverse tasks, various question formats (multiple-choice, fill-in-blank, true/false, open-ended), and covers a wide range of video scenarios and durations. Initial evaluations show that while MLLMs perform well on structured, closed-form questions, they struggle significantly with open-ended generative tasks, particularly in causal reasoning, indicating a reliance on superficial patterns over genuine understanding. The benchmark aims to guide the development of MLLMs towards more robust human behavior comprehension.

Multimodal Large Language Models (MLLMs) have shown impressive progress in understanding visual information, including both images and videos. However, their ability to truly comprehend human-centric video data, which is abundant in the real world, has remained largely unexplored. This gap exists primarily because there hasn’t been a comprehensive and high-quality benchmark specifically designed for this complex area.

Existing benchmarks often focus on simpler aspects like video generation quality or basic action recognition. They tend to overlook the essential perceptual and cognitive abilities required for understanding human behavior in real-world scenarios. Furthermore, many current evaluation methods are limited by single-question formats and overly simplistic metrics, which don’t fully capture the nuances of MLLMs’ performance.

Introducing HV-MMBench: A New Standard for Human-Centric Video Understanding

To address these limitations, researchers have introduced HV-MMBench, a meticulously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding. This new benchmark offers several key features that set it apart:

Diverse Evaluation Dimensions: HV-MMBench includes 15 distinct tasks, ranging from fundamental attribute perception, such as estimating age and recognizing emotions, to advanced cognitive reasoning tasks like predicting social relationships and intentions. This wide array of tasks allows for a thorough assessment of a model’s capabilities.
Varied Data Types and Question Formats: The benchmark incorporates multiple-choice, fill-in-the-blank, true/false, and open-ended question formats. Combined with diverse evaluation metrics, this approach provides a more accurate and robust reflection of model performance, moving beyond simple accuracy to assess deeper understanding.
Multi-Domain Video Coverage: HV-MMBench spans over 50 distinct visual scenarios, enabling comprehensive evaluation across a wide range of fine-grained scene variations. This ensures that models are tested on their ability to generalize across different real-world contexts.
Extensive Temporal Coverage: The benchmark includes videos ranging from short clips (10 seconds) to long-form content (up to 30 minutes). This temporal diversity supports a systematic analysis of models’ ability to reason across different contextual lengths, from instantaneous actions to complex, unfolding events.

How HV-MMBench Was Built

The construction of the HV-MMBench dataset involved a three-step process: video collection and pre-processing, automated question-answer annotation, and rigorous manual quality review. Videos were sourced from publicly available datasets like UltraVideo and OpenHumanVid, covering seven core domains such as daily life, professional activities, and social interactions. The dataset ultimately comprises 1,200 high-quality videos, with most exceeding 1080P resolution.

An automated annotation pipeline was designed to generate question-answer pairs. This involved first labeling video attributes (e.g., identifying if a video is suitable for an emotion recognition question) and then using advanced MLLMs to generate questions and answers based on predefined templates. Distractors (incorrect answer choices) were also carefully designed to be challenging. Finally, a two-stage manual review process, combining automated filtering with expert human verification, ensured the reliability and diversity of the dataset.

Also Read:

Key Findings from Benchmarking MLLMs

Researchers evaluated several state-of-the-art open-source MLLMs on HV-MMBench. The results revealed significant insights into current model capabilities:

Strong Performance in Closed-Form Tasks: Models generally performed well on multiple-choice and true/false questions, especially in high-level cognitive tasks like intention inference and causal reasoning. This suggests that MLLMs are effective at selecting correct answers when provided with options.
Significant Drop in Generative Tasks: In stark contrast, model performance dropped sharply in fill-in-the-blank and open-ended question formats, particularly for causal reasoning. For instance, a model that achieved over 94% accuracy on multiple-choice causal reasoning tasks saw its F1 score drop to near zero in open-ended causal generation. This indicates that models often rely on superficial patterns or pre-trained knowledge to answer closed-form questions, rather than engaging in genuine, deep reasoning.
Limitations in Fine-Grained Perception: Models also showed consistently low accuracy on tasks requiring fine-grained visual perception, such as face recognition, highlighting persistent limitations in understanding subtle visual details.

These findings suggest two critical bottlenecks in existing open-source MLLMs: weak generalization in generative tasks and insufficient grounding in fine-grained perception. By incorporating a diverse set of task types and question formats, HV-MMBench systematically uncovers these limitations and establishes a rigorous evaluation framework to guide the development of future MLLMs.

This new benchmark is a crucial step towards advancing MLLMs to achieve a deeper and more reliable understanding of complex human behaviors in videos. For more details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Deeper AI Understanding of Human Videos with HV-MMBench

Introducing HV-MMBench: A New Standard for Human-Centric Video Understanding

How HV-MMBench Was Built

Key Findings from Benchmarking MLLMs

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates