TLDR: O-DisCo-Edit is a new video editing framework that uses a single, flexible “object distortion control” (O-DisCo) signal to handle various editing tasks like object removal, swaps, and style transfer. This unified approach simplifies training, reduces resource needs, and achieves high-fidelity, realistic video edits while preserving unedited areas, outperforming current state-of-the-art methods.
Video editing has seen incredible advancements thanks to AI, particularly with diffusion models. However, making precise and controllable edits to videos, especially when dealing with various object properties, has remained a significant challenge. Existing methods often require different ‘control signals’ for each specific editing task, leading to complex model designs and demanding substantial training resources.
A new research paper, titled “O-DisCo-Edit: Object Distortion Control for Unified Realistic Video Editing,” introduces a groundbreaking solution to these problems. Authored by Yuqing Chen, Junjie Wang, Lin Liu, Ruihang Chu, Xiaopeng Zhang, Qi Tian, and Yujiu Yang, this work presents O-DisCo-Edit, a unified framework that simplifies and enhances video editing.
The Core Innovation: Object Distortion Control (O-DisCo)
At the heart of O-DisCo-Edit is a novel concept called Object Distortion Control (O-DisCo). This signal, based on random and adaptive noise, is incredibly flexible. It can encapsulate a wide range of editing instructions within a single, unified representation. This means that instead of needing separate controls for different tasks, O-DisCo-Edit can use one signal for many types of edits, making the model design much simpler and significantly reducing the training resources required.
How O-DisCo-Edit Works
The framework operates in two main phases:
-
Training with Random Distortion: During the training phase, the model uses what’s called Random Object Distortion Control (R-O-DisCo). This involves intentionally distorting the colors and fine details of objects in the reference video by applying random arithmetic operations and mosaic-like effects. This process teaches the model to generate video content guided by the first frame’s appearance, rather than just copying existing visual information. It builds robustness and adaptability.
-
Inference with Adaptive Control: For actual video editing, the model employs Adaptive Object Distortion Control (A-O-DisCo). This is achieved by dynamically modifying the contrast and injecting noise into the editable regions of the video. An ‘adaptive controller’ determines the right amount of contrast, noise intensity, and blur based on similarities between the reference image and video frames. This allows for highly precise and multi-grained control over the editing process.
Beyond O-DisCo, the framework includes two other crucial components:
-
“Copy-Form” Preservation (CFP) Module: This module is designed to flawlessly preserve the non-edited regions of the video. It ensures that areas outside the edited object remain consistent and natural, preventing unwanted changes or artifacts.
-
Identity Preservation (IDP) Module: To maintain the appearance of edited objects throughout the video, especially during complex movements or occlusions, the IDP module extracts position-agnostic ‘ID tokens’ from the reference image. These tokens act as a global guide, reinforcing the object’s identity and ensuring consistency.
Achieving State-of-the-Art Performance
Extensive experiments and human evaluations consistently show that O-DisCo-Edit outperforms both specialized and multi-task state-of-the-art methods across a variety of video editing tasks. These tasks include:
-
Object removal
-
Outpainting (extending video boundaries)
-
Object internal motion transfer (e.g., transferring the motion of milk flowing)
-
Lighting transfer
-
Color change
-
Object swap
-
Object addition
-
Style transfer
For instance, in object removal, O-DisCo-Edit successfully avoids background damage and object overlaps seen in other methods. For outpainting, it creates exceptionally well-blended and continuous results, surpassing grainy textures or box-like artifacts produced by baselines. Its ability to accurately capture intricate internal object motions and transfer lighting variations is also highlighted as superior.
Also Read:
- Filling Gaps: 2D Gaussian Splatting for Coherent Image Inpainting
- Enhancing Vision Transformers with Challenging Synthetic Negatives
A New Perspective on Video Editing
The O-DisCo-Edit framework offers a fresh perspective on video editing research. It demonstrates that a single, unified control signal can be both versatile and precise without sacrificing efficiency. This approach dramatically simplifies the training process and reduces resource demands, paving the way for more accessible and powerful video editing tools in the future.
For more details, you can read the full research paper here.


