TLDR: This research introduces MVFNDB, a new benchmark for evaluating multimodal large language models (MLLMs) in video fake news detection. Unlike previous benchmarks, MVFNDB assesses MLLMs’ perception, understanding, and reasoning processes, not just final accuracy, using 10 tasks and 9,730 human-annotated questions. Empirical analysis reveals distinct features of fake vs. real videos (e.g., text color/placement, key footage distribution). Experiments show that direct video stream processing (like Gemini 2.5-Flash) outperforms frame-based methods, and MLLMs struggle with fine-grained visual details like font color. The study also highlights the importance of dynamic frame sampling and temporal localization for better detection.
In an era where misinformation spreads rapidly through digital platforms, the ability to accurately detect fake news, especially in video format, has become critically important. Recent advancements in multi-modal large language models (MLLMs) offer promising avenues for tackling this challenge. However, traditional methods for evaluating these models often fall short, treating the detection process as a ‘black box’ and focusing solely on the final outcome rather than the intricate steps involved.
A new research paper, titled “Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection,” introduces a groundbreaking solution: the MVFNDB (Multimodal Video Fake News Detection Benchmark). This benchmark is designed to provide a more comprehensive and fine-grained assessment of MLLMs’ capabilities in detecting video fake news. The paper was authored by Yakun Cui, Fushuo Huo, Weijie Shi, Juntao Dai, Hang Du, Zhenghao Zhu, Sirui Han, and Yike Guo from institutions including The Hong Kong University of Science and Technology, The Hong Kong Polytechnic University, Beijing University of Posts and Telecommunications, and Peking University. You can read the full paper here.
Addressing the Limitations of Current Benchmarks
The researchers highlight several key limitations of existing video-based fake news detection benchmarks. Firstly, many were designed for older classification models, which operate differently from modern MLLMs. Secondly, there’s a lack of interpretable results; current benchmarks don’t allow us to understand why a model made a certain decision, making it hard to improve. Thirdly, they often focus only on the final result, ignoring the complex process of identifying fake news, which involves analyzing multiple video features.
The MVFNDB aims to overcome these issues by offering a benchmark that aligns with MLLMs’ processing paradigms, provides interpretable results, and evaluates the entire detection process from perception to understanding and reasoning.
The MVFNDB Benchmark: A Deeper Dive
The MVFNDB is built upon the FakeSV dataset, which contains videos from real-world social media platforms like Douyin and Kuaishou. It features 10 distinct tasks and includes 9,730 human-annotated video-related questions. These tasks are meticulously designed to evaluate MLLMs’ abilities in three core areas:
- Perception: How well models can accurately identify fine-grained characteristics in videos, such as key elements, text color, and spatial position of text.
- Understanding: How well models can grasp the main content and theme of the news video.
- Reasoning: How well models can utilize multi-modal information and general knowledge to verify the veracity of videos and generate evidence-based inferences.
The benchmark also introduces a novel framework called MVFND-CoT, which combines reasoning based on both creator-added content (like overlaid text) and original shooting footage.
Empirical Insights into Fake vs. Real Videos
To inform the design of the benchmark tasks, the researchers conducted an empirical analysis of real and fake news videos. They uncovered fascinating differences:
- Color Distribution: Fake news often uses font colors with hue values between 0-5° (often associated with high emotion), while real news prefers colors in the 25-30° range (more formal).
- Spatial Distribution of Text: Text in fake news videos tends to be randomly placed, possibly to obscure original footage. In contrast, real news videos show a more concentrated and deliberate text placement.
- Key Footage Distribution: Real news videos generally contain more on-site shooting, close-ups of characters, and official declarations, especially towards the end of the video. Fake news, however, might place close-ups at the beginning and often lacks extensive on-site footage.
These findings provide concrete features that MLLMs can learn to distinguish between authentic and fabricated content.
Also Read:
- Unpacking the Reasoning Skills of Advanced Video Models
- The Internal Architecture Behind Text Bias in Multimodal Language Models
Experimental Results and Key Takeaways
The study conducted comprehensive experiments with various MLLMs, including proprietary models like Gemini 2.5-Flash and GPT-4o-mini, and open-source models like Qwen2.5-VL and InternVL3. Key insights from these experiments include:
- Model Performance Varies: The Gemini 2.5-Flash model showed superior performance across most tasks, largely because it processes video streams directly, maintaining temporal coherence better than frame-based methods. Among open-source models, Qwen2.5VL-72B performed best due to its larger scale and dynamic frame sampling.
- Challenges with Color Perception: All models struggled with accurately perceiving the font color of creator-added text, with the best model achieving only 47.47% accuracy. This suggests that MLLMs, often derived from language models, prioritize semantic understanding over fine-grained visual details like color.
- Temporal Grounding is Hard: Identifying specific time ranges for key elements within news videos proved challenging, especially since news videos are typically shorter and have fewer distinct key elements compared to general video datasets.
- Optimizing Frame Sampling: The research found that simply increasing the number of sampled frames doesn’t always improve accuracy. Instead, an optimal number of frames exists, which increases with video duration. Dynamic frame sampling strategies were more effective, especially for shorter news videos.
- Importance of Key Elements: Models perform better when there are sufficient key elements in the video and when they possess strong temporal localization capabilities. As the number of key elements increases, less capable models struggle to capture crucial information.
The MVFNDB and the insights derived from this research lay a strong foundation for future advancements in MLLMs for video fake news detection, moving towards more transparent, process-oriented, and accurate verification systems.


