TLDR: v-HUB is a new visual-centric video humor understanding benchmark designed to evaluate multimodal large language models (MLLMs). It uses minimally verbal short videos from silent films and online sources with rich annotations. The benchmark’s evaluation tasks (Caption Matching, Humor Explanation, Open-ended QA) reveal that MLLMs heavily rely on linguistic cues, struggle with visual and subtle humor inference, and show limited cross-modal fusion, though audio and visual text can offer some help.
Artificial intelligence models are becoming increasingly sophisticated, but one area where they still face significant challenges is understanding humor. Humor, often relying on complex reasoning, social nuances, and cultural contexts, is difficult even for humans to fully grasp. This challenge is amplified when it comes to multimodal humor, especially in videos where visual cues play a dominant role.
Addressing this gap, a new research paper introduces v-HUB, a novel visual-centric video humor understanding benchmark. Developed by researchers including Zhengpeng Shi, Hengli Li, and Yanpeng Zhao, v-HUB aims to gauge and diagnose the capacity of multimodal large language models (MLLMs) in comprehending humor primarily through visual information. You can read the full paper here: v-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs.
v-HUB is built upon a carefully curated collection of minimally verbal short videos. These videos are sourced from two complementary domains: classic silent films, particularly those of Charlie Chaplin, and user-generated funny videos from online platforms. The key criterion for selection was that the humor in these clips must be appreciable purely through visual cues, reflecting real-world scenarios where humor transcends spoken language.
Each video in the v-HUB dataset comes with rich annotations. These include detailed captions, descriptions of events, and explanations of the humor. These annotations support various evaluation tasks designed to thoroughly test MLLMs. The benchmark features three distinct tasks:
Caption Matching
This task challenges MLLMs to correctly associate videos with their corresponding captions. Unlike typical caption matching, v-HUB’s design requires models to go beyond surface-level descriptions and appreciate nuanced, extended humor, often involving creative captions that enhance the comedic effect.
Humor Explanation
In this generative task, models must identify the humorous elements within each video and provide coherent explanations, referencing relevant visual or auditory cues. This evaluates the MLLMs’ ability to articulate *why* something is funny.
Also Read:
- Decoding How AI Understands the World: A Multimodal Perspective
- Decoding LLM’s Visual Intuition from Language Pre-training
Open-ended QA
To provide a broader assessment of video reasoning skills, v-HUB includes an open-ended question-answering task. These questions, automatically generated and manually verified, cover temporal, descriptive, and causal aspects of the video content, extending the benchmark beyond humor-specific reasoning.
The researchers evaluated a diverse set of MLLMs, encompassing both open-source and proprietary models, including specialized Video-LLMs and versatile OmniLLMs that can process audio. The evaluations were conducted under three settings: Text-Only (models receive human-written descriptions), Video-Only (models receive raw video frames without audio), and Video+Audio (models receive both visual and auditory signals).
The experimental results from v-HUB reveal significant shortcomings in current MLLMs’ ability to understand visual-centric humor. A consistent finding was that models perform substantially better with text-only inputs compared to video-based evaluations. For instance, a model’s performance on Open-ended QA dropped significantly when moving from text descriptions to raw video. This suggests a heavy reliance on linguistic cues and underdeveloped cross-modal fusion capabilities, meaning MLLMs struggle to effectively integrate visual and auditory signals.
Furthermore, the benchmark exposed MLLMs’ limited capacity for subtle humor inference. Even in the most favorable text-only conditions, models struggled with the Caption Matching task, which requires connecting creative, non-obvious text to visual humor. This difficulty was magnified when processing raw video data.
The study also found that while incorporating audio provides a marginal but consistent performance boost, this gain is minimal compared to the contribution of text. Interestingly, visual text (like embedded captions or subtitles) proved more effective in aiding humor understanding than sound cues alone, and could even compensate for the absence of informative sound. Background knowledge also played a crucial role, with MLLMs performing better when such context was explicitly provided.
Finally, the research highlighted that MLLMs face greater difficulty comprehending humor in historically distant videos, such as Charlie Chaplin’s silent films, compared to contemporary user-generated content. This underscores the sensitivity of humor understanding to temporal and cultural contexts.
In conclusion, v-HUB presents a new and challenging benchmark that exposes the weaknesses of current MLLMs in visual-centric humor understanding. It emphasizes the need for enhancing their visual reasoning capabilities and highlights the potential of integrating richer modalities like sound for complex video understanding tasks, pushing the boundaries of AI’s ability to truly ‘get’ a joke.


