spot_img
HomeResearch & DevelopmentInteractive AI Streamlines Video Dataset Collection

Interactive AI Streamlines Video Dataset Collection

TLDR: VC-Agent is an interactive AI system that uses multimodal large language models (MLLMs) to automate and scale the collection of customized video datasets from the internet. It works through iterative user feedback, where users provide initial queries, confirm or reject proposed videos, and offer comments. This feedback continuously refines the agent’s filtering policies (both acceptance and rejection), allowing it to efficiently gather high-quality, specific video data with minimal human effort, significantly reducing the time and labor traditionally required for such tasks.

Collecting large and specific video datasets for training artificial intelligence models has traditionally been a monumental task, often requiring extensive manual effort, which is both time-consuming and costly. Imagine needing thousands of videos of ‘cats lying down, but no black cats, and only close-up shots with a single cat.’ Manually sifting through the vastness of the internet for such precise criteria is a daunting prospect.

This challenge is precisely what a new interactive AI system, called VC-Agent, aims to solve. Developed by a team of researchers including Yidan Zhang, Mutian Xu, Yiming Hao, Kun Zhou, Jiahao Chang, Xiaoqiang Liu, Pengfei Wan, Hongbo Fu, and Xiaoguang Han, VC-Agent is designed to automate and scale the collection of customized video datasets from the internet with minimal user input.

What is VC-Agent and How Does It Work?

VC-Agent stands out as the first interactive agent powered by Multimodal Large Language Models (MLLMs) that can effectively gather tailored video datasets. It operates through a series of user-friendly interactions, progressively refining its understanding of user demands and its video filtering capabilities.

The process begins with a simple user interface. A user starts by providing an ‘initial query,’ a rough textual description of the desired video dataset, such as “Please build a petting cat dataset.” The agent then proposes a selection of candidate videos. Users review these videos, providing ‘confirmations’ by accepting or rejecting clips. Crucially, for rejected videos, users can offer ‘comments’ explaining their reasons, like “No black cat” or “Cat should lie down.”

Behind this intuitive interface, VC-Agent employs sophisticated ‘agent functions’ to process and act on user feedback:

  • Video Proposal: Based on the initial query, the agent generates keywords to search public video platforms. It then uses advanced video grounding models to identify and extract only the most relevant segments from the downloaded videos, ensuring that only pertinent clips are considered.
  • Filtering Policy: This is the core intelligence of VC-Agent, featuring two dynamically updated policies:
    • Attribute-Aware Rejection Policy: When a user rejects a video and provides a comment (e.g., “No black cat”), the agent’s MLLMs summarize these specific attributes (like ‘appearance’ and ‘black’) into a ‘negative standard table.’ In subsequent rounds, any candidate video matching these negative attributes will be discarded.
    • Template-Based Acceptance Policy: For videos confirmed by the user, the MLLMs analyze and describe their content. These descriptions are then aggregated into ‘positive criterion templates,’ which serve as examples of what to accept. Future videos that align with these templates are retained.

This entire process is iterative. With each round of user interaction, the filtering policies become more refined and accurate. The agent even includes a ‘user-assisted double-check’ strategy for videos with low confidence scores, prompting users for specific feedback on ambiguous attributes, further enhancing robustness.

Once the user is consistently satisfied with the filtered videos over several rounds, VC-Agent transitions into a fully ‘automatic scale-up’ mode, using the finalized policies to collect a large-scale dataset without further human intervention.

Also Read:

Impact and Benefits

Extensive experiments and user studies have demonstrated VC-Agent’s effectiveness and efficiency. It significantly outperforms traditional video retrieval methods and other MLLMs, especially when dealing with complex and detailed requirements. Users reported high satisfaction with the data quality and noted a drastic reduction in collection time – what might take weeks or months manually, VC-Agent can achieve in a matter of hours of actual user interaction, followed by automated processing.

The collected datasets have real-world benefits. For instance, data gathered by VC-Agent was used to fine-tune text-to-video generative models, resulting in more realistic and detailed video outputs for specialized text inputs. Similarly, it helped improve pose estimation models for biped cartoon characters, a task where existing methods often struggle due to their focus on human subjects.

VC-Agent represents a significant step forward in customized data collection, offering a powerful tool for researchers and developers to build high-quality, domain-specific video datasets efficiently. For more details, you can refer to the full research paper: VC-Agent: An Interactive Agent for Customized Video Dataset Collection.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -