TLDR: Researchers developed MARLCC, a multi-agent reinforcement learning framework for video moment retrieval. It uses evidential learning to allow agents to compete and resolve conflicts, leading to improved accuracy in finding video moments. Crucially, it can detect ‘out-of-scope’ queries (queries with no matching video moment) in a zero-shot manner by observing high conflict among agents, eliminating the need for extra training.
Video moment retrieval is a fascinating area of artificial intelligence that helps us quickly find specific moments within long, untrimmed videos using simple text queries. Imagine trying to find “the person starts cooking with a pan” in a two-hour movie – this technology aims to pinpoint that exact scene for you. This capability is incredibly useful for various applications, from searching movie scenes to monitoring surveillance footage for specific events or analyzing athlete performance.
Traditionally, video moment retrieval models have focused on finding moments when they are sure to exist within a video. However, a significant challenge arises when a user’s query doesn’t have a corresponding moment in the video at all – what’s known as an “out-of-scope” query. Current systems often struggle with this, either requiring additional training to detect such queries or failing to integrate different models effectively when their results conflict.
A new research paper titled “Who Can We Trust? Scope-Aware Video Moment Retrieval with Multi-Agent Conflict” by Chaochen Wu, Guan Luo, Meiyun Zuo, and Zhitao Fan introduces a novel approach to tackle these issues. The researchers propose a reinforcement learning-based model that not only accurately locates video moments but also intelligently handles conflicts between different models and identifies out-of-scope queries without needing extra training. You can read the full paper here.
A Multi-Agent System for Smarter Retrieval
The core of their innovation lies in a multi-agent system (MAS) framework called MARLCC (Multi-Agent RL Competition and Conflict). This system employs multiple independent “agents” – essentially different AI models – to work on the same video moment retrieval task. One of these agents is a newly proposed model called ESRL (Evidential Scanner for RL-base MR), which scans the entire video to find moment boundaries and provides “evidential learning” for its predictions.
Evidential learning is a key component. When an agent makes a prediction about a moment’s location, it also generates “evidence” and “uncertainty” for that prediction. Think of it as the agent not just saying “this is the moment,” but also “I’m this confident about it.” This allows the system to understand how reliable each agent’s output is.
Competition and Conflict Among Agents
MARLCC leverages two main concepts: competition and conflict. In the “competition” aspect, different agents independently propose their best-located moments. The system then uses the “evidence” generated by each agent to determine a “trusted IoU” (Intersection over Union) score, which indicates how well the predicted moment overlaps with the actual moment. The agent with the highest trusted IoU is declared the “winner,” and its result is chosen as the final output. This allows the system to combine the strengths of various agents.
The “conflict” aspect is where the system shines in detecting out-of-scope queries. The researchers observed a significant phenomenon: when a query is out-of-scope (meaning there’s no matching moment in the video), the different agents tend to have much higher disagreement or “conflict” in their proposed moment locations. This conflict is measured by the difference in their predicted start and end timestamps. By setting a threshold, MARLCC can identify these high-conflict scenarios as out-of-scope queries in a “zero-shot” manner – meaning it doesn’t need to be specifically trained on out-of-scope examples. This is a major advantage for real-world applications.
Also Read:
- Building Trust in Medical AI Through Action-Based Reasoning
- Unveiling TRACES: Real-Time Video Anomaly Detection with Contextual Memory
Improved Performance and Real-World Applications
Extensive experiments on benchmark datasets like Charades-STA and ActivityNet-Captions demonstrated the effectiveness of MARLCC. The system achieved state-of-the-art results compared to other reinforcement learning-based methods, and even outperformed some non-RL approaches. The ability to detect out-of-scope queries with high accuracy without additional training is particularly valuable, as it prevents the model’s primary moment retrieval ability from being weakened by an extra detection task.
This research opens new avenues for more robust and intelligent video search applications. Users can now be confident that if a query doesn’t have a match, the system can tell them, rather than providing a potentially incorrect or irrelevant result. The findings also highlight the power of modeling competition and conflict within multi-agent systems to enhance reinforcement learning performance.


