New Benchmark Reveals AI's Struggle with Visual Humor in Videos

TLDR: v-HUB is a new visual-centric video humor understanding benchmark designed to evaluate multimodal large language models (MLLMs). It uses minimally verbal short videos from silent films and online sources with rich annotations. The benchmark’s evaluation tasks (Caption Matching, Humor Explanation, Open-ended QA) reveal that MLLMs heavily rely on linguistic cues, struggle with visual and subtle humor inference, and show limited cross-modal fusion, though audio and visual text can offer some help.

Artificial intelligence models are becoming increasingly sophisticated, but one area where they still face significant challenges is understanding humor. Humor, often relying on complex reasoning, social nuances, and cultural contexts, is difficult even for humans to fully grasp. This challenge is amplified when it comes to multimodal humor, especially in videos where visual cues play a dominant role.

Addressing this gap, a new research paper introduces v-HUB, a novel visual-centric video humor understanding benchmark. Developed by researchers including Zhengpeng Shi, Hengli Li, and Yanpeng Zhao, v-HUB aims to gauge and diagnose the capacity of multimodal large language models (MLLMs) in comprehending humor primarily through visual information. You can read the full paper here: v-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs.

v-HUB is built upon a carefully curated collection of minimally verbal short videos. These videos are sourced from two complementary domains: classic silent films, particularly those of Charlie Chaplin, and user-generated funny videos from online platforms. The key criterion for selection was that the humor in these clips must be appreciable purely through visual cues, reflecting real-world scenarios where humor transcends spoken language.

Each video in the v-HUB dataset comes with rich annotations. These include detailed captions, descriptions of events, and explanations of the humor. These annotations support various evaluation tasks designed to thoroughly test MLLMs. The benchmark features three distinct tasks:

Caption Matching

This task challenges MLLMs to correctly associate videos with their corresponding captions. Unlike typical caption matching, v-HUB’s design requires models to go beyond surface-level descriptions and appreciate nuanced, extended humor, often involving creative captions that enhance the comedic effect.

Humor Explanation

In this generative task, models must identify the humorous elements within each video and provide coherent explanations, referencing relevant visual or auditory cues. This evaluates the MLLMs’ ability to articulate *why* something is funny.

Also Read:

Open-ended QA

To provide a broader assessment of video reasoning skills, v-HUB includes an open-ended question-answering task. These questions, automatically generated and manually verified, cover temporal, descriptive, and causal aspects of the video content, extending the benchmark beyond humor-specific reasoning.

The researchers evaluated a diverse set of MLLMs, encompassing both open-source and proprietary models, including specialized Video-LLMs and versatile OmniLLMs that can process audio. The evaluations were conducted under three settings: Text-Only (models receive human-written descriptions), Video-Only (models receive raw video frames without audio), and Video+Audio (models receive both visual and auditory signals).

The experimental results from v-HUB reveal significant shortcomings in current MLLMs’ ability to understand visual-centric humor. A consistent finding was that models perform substantially better with text-only inputs compared to video-based evaluations. For instance, a model’s performance on Open-ended QA dropped significantly when moving from text descriptions to raw video. This suggests a heavy reliance on linguistic cues and underdeveloped cross-modal fusion capabilities, meaning MLLMs struggle to effectively integrate visual and auditory signals.

Furthermore, the benchmark exposed MLLMs’ limited capacity for subtle humor inference. Even in the most favorable text-only conditions, models struggled with the Caption Matching task, which requires connecting creative, non-obvious text to visual humor. This difficulty was magnified when processing raw video data.

The study also found that while incorporating audio provides a marginal but consistent performance boost, this gain is minimal compared to the contribution of text. Interestingly, visual text (like embedded captions or subtitles) proved more effective in aiding humor understanding than sound cues alone, and could even compensate for the absence of informative sound. Background knowledge also played a crucial role, with MLLMs performing better when such context was explicitly provided.

Finally, the research highlighted that MLLMs face greater difficulty comprehending humor in historically distant videos, such as Charlie Chaplin’s silent films, compared to contemporary user-generated content. This underscores the sensitivity of humor understanding to temporal and cultural contexts.

In conclusion, v-HUB presents a new and challenging benchmark that exposes the weaknesses of current MLLMs in visual-centric humor understanding. It emphasizes the need for enhancing their visual reasoning capabilities and highlights the potential of integrating richer modalities like sound for complex video understanding tasks, pushing the boundaries of AI’s ability to truly ‘get’ a joke.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

New Benchmark Reveals AI’s Struggle with Visual Humor in Videos

Caption Matching

Humor Explanation

Open-ended QA

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates