TLDR: Current video search (Moment Retrieval) typically focuses on finding a single relevant clip for a query, which often oversimplifies real-world scenarios. This research introduces Multi-Moment Retrieval (MMR) to identify all relevant temporal segments. The paper presents QV-M2, the first fully human-annotated dataset specifically for MMR, along with new evaluation metrics. It also proposes FlashMMR, a novel framework featuring a Multi-Moment Post-Verification module that refines moment boundaries and enhances semantic consistency. FlashMMR significantly outperforms existing methods, establishing a new benchmark for comprehensive video understanding.
In the evolving landscape of artificial intelligence, understanding and interacting with video content remains a significant challenge. One key area is Moment Retrieval (MR), where the goal is to pinpoint specific video segments that match a natural language query. Traditionally, most methods have focused on Single-Moment Retrieval (SMR), assuming that a query corresponds to just one relevant moment in a video. However, real-world scenarios are often far more complex.
Imagine searching an instructional video for “cutting vegetables.” An SMR system might only show you the first instance of chopping. But what if the video features several different types of vegetables being cut at various points? A single moment simply isn’t enough to capture the full context of the query. This limitation highlights a crucial gap between current MR techniques and practical applications.
Introducing Multi-Moment Retrieval (MMR)
To bridge this gap, researchers Zhuo Cao, Heming Du, Bingqing Zhang, Xin Yu, Xue Li, and Sen Wang from The University of Queensland have introduced a new paradigm: Multi-Moment Retrieval (MMR). Their work, detailed in the paper “When One Moment Isn’t Enough: Multi-Moment Retrieval with Cross-Moment Interactions,” addresses the need to identify multiple relevant, non-overlapping moments for a single query.
A New Dataset for Realistic Video Understanding: QV-M2
A major hurdle for advancing MMR has been the lack of suitable datasets and evaluation metrics. To tackle this, the team developed QV-M2 (QVHighlights Multi-Moment Dataset). This high-quality, human-annotated dataset is built upon the widely used QVHighlights dataset but explicitly accounts for queries with multiple relevant moments. QV-M2 features 2,212 new queries linked to 1,341 videos, covering a total of 6,384 annotated temporal moments. On average, each query in QV-M2 corresponds to 2.9 moments, a significant increase compared to the typical single-moment assumption in previous datasets.
The creation of QV-M2 involved a meticulous manual annotation process, ensuring detailed queries that capture actors, actions, and contexts, including context-dependent and even negative queries. This rigorous approach makes QV-M2 the first fully human-annotated dataset specifically designed for MMR benchmarking.
Comprehensive Evaluation with New Metrics
Alongside the dataset, the researchers also proposed a comprehensive suite of new evaluation metrics tailored for MMR. These metrics extend standard measures like mean Average Precision (mAP) and Intersection-over-Union (IoU) to effectively assess performance in multi-moment scenarios. Key new metrics include Generalized mAP (G-mAP), Mean IoU@k, and Mean Recall@k, which provide a more nuanced understanding of how well models identify all relevant moments and their temporal accuracy.
FlashMMR: A Novel Framework for Multi-Moment Retrieval
To further advance the field, the paper introduces FlashMMR, a novel framework explicitly designed for MMR. FlashMMR extends traditional SMR pipelines by incorporating a crucial Multi-Moment Post-Verification module. This module is key to refining moment boundaries and ensuring that retrieved moments are semantically consistent with the query, while also filtering out low-confidence or irrelevant predictions.
The framework involves several stages: Feature Extraction and Fusion, where video and text features are aligned; Multi-Scale Temporal Processing, which captures temporal variations across different moment durations; and the Post-Verification Module, which refines initial predictions through structured post-processing and semantic consistency control. This sophisticated pipeline helps FlashMMR achieve robust multi-moment alignment.
Also Read:
- Advancing AI’s Continuous Learning in Audio-Visual Understanding
- New Benchmark Unveils Multimodal AI’s Challenges in Video Dialogues
Promising Results and Future Directions
Extensive experiments demonstrate the effectiveness of both the QV-M2 dataset and the FlashMMR framework. Models trained with QV-M2 consistently show improved performance, highlighting the dataset’s value. FlashMMR itself significantly outperforms prior state-of-the-art methods across all MMR metrics. For instance, on QV-M2, FlashMMR achieved improvements of 3.00% on G-mAP, 2.70% on mAP@3+tgt, and 2.56% on mR@3 over the previous best method.
An ablation study confirmed the critical role of the Post-Verification module, showing consistent performance gains across both QV-M2 and QVHighlights datasets. While these advancements mark a significant step forward, the authors acknowledge that challenges remain, such as exploring more advanced verification strategies and the need for even larger, high-quality annotated datasets to support future progress.
This research lays a strong foundation for more realistic and challenging video temporal grounding scenarios, pushing the boundaries of how AI understands and retrieves information from complex video content. You can read the full research paper here.


