Video-As-Prompt: A Unified Framework for Semantic Video Generation

TLDR: Video-As-Prompt (VAP) introduces a new method for controlling video generation using a reference video as a semantic guide. It employs a Mixture-of-Transformers (MoT) expert with a frozen Video Diffusion Transformer and a unique temporally biased position embedding. VAP achieves state-of-the-art results on a new large dataset, VAP-Data, offering unified and generalizable semantic control across various conditions like concept, style, motion, and camera, even demonstrating zero-shot capabilities.

A new research paper introduces a groundbreaking approach to video generation called Video-As-Prompt (VAP), aiming to solve the long-standing challenge of achieving unified and generalizable semantic control. This innovative method, developed by researchers from Intelligent Creation Lab, ByteDance, and The Chinese University of Hong Kong, reframes video generation as an in-context process, using a reference video as a direct semantic guide.

Current methods for controlling video generation often fall short. Structure-based controls, which rely on pixel-aligned conditions like depth or pose, can introduce unwanted artifacts when applied to semantic tasks. Other approaches are either too specific, requiring costly fine-tuning for each semantic condition (like a ‘Ghibli style’ or ‘Hitchcock camera zoom’), or involve designing unique architectures for every task. These limitations hinder the creation of a single, versatile model that can handle diverse semantic controls and generalize to new, unseen conditions.

VAP tackles these issues by leveraging a reference video as a ‘video prompt’ that directly conveys the desired semantics. This prompt guides a pre-trained, frozen Video Diffusion Transformer (DiT) through a clever addition: a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture is crucial because it prevents ‘catastrophic forgetting,’ ensuring the core DiT model retains its powerful generation abilities while gaining new control. The MoT expert and the frozen DiT work in parallel, exchanging information to synchronize guidance at each layer.

A key innovation in VAP is its ‘temporally biased position embedding.’ Traditional position embeddings can impose a false pixel-level mapping between the reference and target videos, leading to unsatisfactory results. VAP’s biased embedding shifts the reference prompt’s temporal indices, effectively removing this spurious prior and aligning with the temporal order expected for in-context generation, while maintaining spatial consistency.

To support this new paradigm and encourage further research, the team also built VAP-Data, which is currently the largest dataset for semantic-controlled video generation. This extensive dataset comprises over 100,000 paired videos across 100 semantic conditions, providing a robust foundation for training and evaluating unified semantic control models. The conditions are categorized into concept (like entity transformation), style, motion (human and non-human), and camera movement.

The performance of VAP is impressive. As a single, unified model, it sets a new state-of-the-art for open-source methods, achieving a 38.7% user preference rate that is competitive with leading condition-specific commercial models. VAP demonstrates strong zero-shot generalization, meaning it can apply unseen semantic conditions from new reference videos without additional training. This capability is a significant step towards truly general-purpose, controllable video generation.

The researchers conducted extensive comparisons, showing that VAP outperforms existing structure-controlled methods, which struggle with semantic tasks due to their pixel-mapping bias. It also surpasses condition-specific fine-tuning approaches like LoRA, which, while achieving strong semantic alignment, require a separate model for each condition and lack generalization. VAP’s unified approach, treating all semantic conditions as a reference-video prompt, is a major differentiator.

While VAP represents a significant advance, the authors acknowledge certain limitations. The VAP-Data dataset, though large, is synthetic and derived from other generative models, which might inherit their stylistic biases or artifacts. The quality of generated videos can also be influenced by the alignment between reference and target captions, as well as structural similarities between subjects. Future work will explore larger-scale, real-world datasets and more sophisticated multi-reference control mechanisms. For more technical details, you can read the full research paper here.

Also Read:

In conclusion, Video-As-Prompt offers a powerful and flexible framework for semantic-controlled video generation, moving beyond fragmented, condition-specific solutions towards a unified and generalizable model that can interpret and transfer complex semantic cues from reference videos.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Video-As-Prompt: A Unified Framework for Semantic Video Generation

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates