Interactive AI Streamlines Video Dataset Collection

TLDR: VC-Agent is an interactive AI system that uses multimodal large language models (MLLMs) to automate and scale the collection of customized video datasets from the internet. It works through iterative user feedback, where users provide initial queries, confirm or reject proposed videos, and offer comments. This feedback continuously refines the agent’s filtering policies (both acceptance and rejection), allowing it to efficiently gather high-quality, specific video data with minimal human effort, significantly reducing the time and labor traditionally required for such tasks.

Collecting large and specific video datasets for training artificial intelligence models has traditionally been a monumental task, often requiring extensive manual effort, which is both time-consuming and costly. Imagine needing thousands of videos of ‘cats lying down, but no black cats, and only close-up shots with a single cat.’ Manually sifting through the vastness of the internet for such precise criteria is a daunting prospect.

This challenge is precisely what a new interactive AI system, called VC-Agent, aims to solve. Developed by a team of researchers including Yidan Zhang, Mutian Xu, Yiming Hao, Kun Zhou, Jiahao Chang, Xiaoqiang Liu, Pengfei Wan, Hongbo Fu, and Xiaoguang Han, VC-Agent is designed to automate and scale the collection of customized video datasets from the internet with minimal user input.

What is VC-Agent and How Does It Work?

VC-Agent stands out as the first interactive agent powered by Multimodal Large Language Models (MLLMs) that can effectively gather tailored video datasets. It operates through a series of user-friendly interactions, progressively refining its understanding of user demands and its video filtering capabilities.

The process begins with a simple user interface. A user starts by providing an ‘initial query,’ a rough textual description of the desired video dataset, such as “Please build a petting cat dataset.” The agent then proposes a selection of candidate videos. Users review these videos, providing ‘confirmations’ by accepting or rejecting clips. Crucially, for rejected videos, users can offer ‘comments’ explaining their reasons, like “No black cat” or “Cat should lie down.”

Behind this intuitive interface, VC-Agent employs sophisticated ‘agent functions’ to process and act on user feedback:

Video Proposal: Based on the initial query, the agent generates keywords to search public video platforms. It then uses advanced video grounding models to identify and extract only the most relevant segments from the downloaded videos, ensuring that only pertinent clips are considered.
Filtering Policy: This is the core intelligence of VC-Agent, featuring two dynamically updated policies:

Attribute-Aware Rejection Policy: When a user rejects a video and provides a comment (e.g., “No black cat”), the agent’s MLLMs summarize these specific attributes (like ‘appearance’ and ‘black’) into a ‘negative standard table.’ In subsequent rounds, any candidate video matching these negative attributes will be discarded.
Template-Based Acceptance Policy: For videos confirmed by the user, the MLLMs analyze and describe their content. These descriptions are then aggregated into ‘positive criterion templates,’ which serve as examples of what to accept. Future videos that align with these templates are retained.

This entire process is iterative. With each round of user interaction, the filtering policies become more refined and accurate. The agent even includes a ‘user-assisted double-check’ strategy for videos with low confidence scores, prompting users for specific feedback on ambiguous attributes, further enhancing robustness.

Once the user is consistently satisfied with the filtered videos over several rounds, VC-Agent transitions into a fully ‘automatic scale-up’ mode, using the finalized policies to collect a large-scale dataset without further human intervention.

Also Read:

Impact and Benefits

Extensive experiments and user studies have demonstrated VC-Agent’s effectiveness and efficiency. It significantly outperforms traditional video retrieval methods and other MLLMs, especially when dealing with complex and detailed requirements. Users reported high satisfaction with the data quality and noted a drastic reduction in collection time – what might take weeks or months manually, VC-Agent can achieve in a matter of hours of actual user interaction, followed by automated processing.

The collected datasets have real-world benefits. For instance, data gathered by VC-Agent was used to fine-tune text-to-video generative models, resulting in more realistic and detailed video outputs for specialized text inputs. Similarly, it helped improve pose estimation models for biped cartoon characters, a task where existing methods often struggle due to their focus on human subjects.

VC-Agent represents a significant step forward in customized data collection, offering a powerful tool for researchers and developers to build high-quality, domain-specific video datasets efficiently. For more details, you can refer to the full research paper: VC-Agent: An Interactive Agent for Customized Video Dataset Collection.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Interactive AI Streamlines Video Dataset Collection

What is VC-Agent and How Does It Work?

Impact and Benefits

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Sulava, The Digital Neighborhood’s AI Pioneer, Crowned Microsoft’s Global Partner of the Year for Copilot and AI Agents

AI Agent Startup Genspark Achieves Unicorn Status with Over $200 Million Series B Funding

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates