TLDR: A new research paper introduces a two-stage training strategy for instruction-based video editing that significantly reduces reliance on large, expensive paired datasets. By pretraining a foundation model on 1 million unpaired video clips to learn basic editing concepts and then fine-tuning it with fewer than 150,000 high-quality editing pairs, the method achieves superior performance in instruction following and visual quality, outperforming existing approaches by up to 15%. This innovative approach makes high-quality video editing more accessible and efficient.
Instruction-based video editing, where users can simply describe desired changes in text and have a video transformed, has long been a challenging frontier in artificial intelligence. While image editing has seen rapid advancements, video editing lags due to the immense difficulty and cost associated with creating large datasets of paired original and edited videos. Imagine needing millions of examples of a video before and after a specific edit – the resources required are staggering.
A new research paper, titled “IN-CONTEXT LEARNING WITH UNPAIRED CLIPS FOR INSTRUCTION-BASED VIDEO EDITING,” introduces an innovative approach to overcome this data scarcity. Authored by Xinyao Liao, Xianfang Zeng, Ziye Song, Zhoujie Fu, Gang Yu, and Guosheng Lin, this work proposes a low-cost pretraining strategy that leverages readily available, unpaired video clips. This method allows a foundational video generation model to learn general editing capabilities, such as adding, replacing, or deleting elements, based on simple instructions.
A Two-Stage Training Breakthrough
The core of this new framework lies in its two-stage training strategy. First, a foundation video generation model, built upon HunyuanVideoT2V, undergoes a pretraining phase using approximately 1 million real video clips. During this stage, the model learns fundamental editing concepts by observing natural variations between different clips from the same scene. For instance, if two clips from a continuous video segment show a person moving slightly or an object changing position, the model learns to interpret these differences as potential ‘edits’ and generates instructions describing them.
This pretraining is crucial because real video clips offer high visual quality, free from the artifacts often found in synthetically generated data. By learning from these diverse, real-world examples, the model develops a strong understanding of how to preserve contextual information like scene layout, character identity, and object appearance, which is vital for realistic video editing.
Following this extensive pretraining, the model enters a supervised fine-tuning (SFT) stage. Here, it is refined using a much smaller dataset of fewer than 150,000 high-quality, curated editing pairs. These pairs are specifically designed to teach the model more complex and stylized editing tasks, further enhancing its ability to follow precise instructions and improve overall editing quality. The researchers found that this small amount of high-quality data is sufficient to significantly boost the model’s performance without causing it to overfit.
Smart Data Curation
The success of this two-stage approach heavily relies on intelligent data curation. For the pretraining phase, raw videos are segmented into short clips. Two clips from the same segment are randomly selected as the ‘original’ and ‘pseudo-edited’ videos. An AI system then generates an instruction describing the differences between them. This process is augmented by rewriting action verbs (e.g., ‘replace’ to ‘change’) and filtering out trivial instructions.
For the SFT stage, a synthetic data pipeline was designed to create high-quality editing pairs. This involves using a video inpainting model (VACE) to modify specific regions of a video based on masks and captions, then generating an editing instruction for the transformation. A rigorous multi-stage filtering process, including evaluation by advanced AI models like Qwen2.5-VL and GPT-5, ensures only the highest quality samples are retained, balancing different editing types to prevent bias.
Also Read:
- Seamless Multi-View Image Editing Without Extensive Training
- CanvasMAR: A Novel Approach for Enhanced Video Generation with Global Frame Prediction
Superior Performance
The experimental results are compelling. The new method significantly outperforms existing instruction-based video editing approaches in both how well it follows instructions and the visual fidelity of the generated videos. It achieved a 12% improvement in editing instruction following and a 15% improvement in editing quality compared to previous state-of-the-art models. The model also demonstrated superior performance in metrics like subject consistency, motion smoothness, and temporal flickering, indicating highly stable and natural-looking edits.
Ablation studies confirmed the effectiveness of the two-stage strategy, showing that pretraining on video clips is essential for acquiring basic editing capabilities and that the subsequent fine-tuning rapidly extends these capabilities to a wider range of tasks. This innovative approach offers a promising path forward for instruction-based video editing, making high-quality video transformations more accessible and efficient. You can read the full research paper here: IN-CONTEXT LEARNING WITH UNPAIRED CLIPS FOR INSTRUCTION-BASED VIDEO EDITING.


