TLDR: ROSE is a novel framework for video object removal that effectively eliminates objects along with their environmental side effects like shadows, reflections, and lighting changes. It addresses data scarcity by generating a large-scale synthetic dataset using a 3D rendering engine. The framework employs a diffusion transformer model with reference-based erasing, mask augmentation, and an explicit difference mask predictor to localize and remove object-correlated areas. ROSE outperforms existing methods and introduces a new benchmark, ROSE-Bench, for comprehensive evaluation of side effect removal.
Video editing has seen remarkable advancements, especially with the rise of generative AI models. However, a persistent challenge in video object removal has been the accurate elimination of an object’s environmental effects, such as its shadows, reflections, and changes in lighting. Often, existing tools struggle to remove these subtle yet crucial details, leading to unnatural or incomplete results.
A new research paper introduces a framework called ROSE, which stands for “Remove Objects with Side Effects in Videos.” This innovative system systematically addresses how objects influence their surroundings, categorizing these interactions into five common cases: shadows, reflections, light, translucency, and mirror effects.
Overcoming Data Scarcity with Synthetic Worlds
One of the biggest hurdles in developing models that can handle these side effects is the lack of paired video data—videos of a scene both with and without a specific object and its corresponding environmental impact. To tackle this, the ROSE team leveraged a 3D rendering engine, like Unreal Engine, to generate synthetic data. They developed a fully-automatic pipeline to create a vast, paired dataset. This dataset features diverse scenes, objects, camera angles, and trajectories, ensuring that the model learns from a wide range of realistic scenarios.
The data preparation pipeline involves collecting virtual environments, splitting them into scenes with candidate objects, and then automatically generating multiple camera views. A key advantage of using a 3D engine is the ability to create perfectly accurate object masks. The system then renders two versions of each video: one with the object present and one with the object removed, ensuring perfect spatial and temporal alignment. This meticulous process allows for pixel-wise supervised learning, which is critical for understanding and removing subtle side effects.
How ROSE Works
ROSE is implemented as a video inpainting model built upon a diffusion transformer architecture. Unlike previous methods that might only feed the non-object area into the model, ROSE takes the entire video as input. This “reference-based erasing” approach allows the model to use the complete video as guidance, helping it to better localize and understand the object-correlated areas and their side effects.
To make the model robust to real-world variations in user-provided masks, ROSE incorporates a mask augmentation strategy during training. This includes using original precise masks, sparse point-wise masks, bounding box masks, and both dilated and eroded masks. This exposure to diverse mask types improves the model’s ability to generalize to imperfect inputs.
Furthermore, ROSE introduces an explicit supervision mechanism through a “difference mask predictor.” This predictor is trained to identify all areas in the video that are affected by the object’s removal, beyond just the object itself. By comparing the original and edited videos, a ground-truth difference mask is computed, highlighting areas like shadows or reflections. This additional supervision helps the model to be highly sensitive to these subtle visual effects.
Benchmarking Performance
To thoroughly evaluate the model’s performance across various side effect removal challenges, the researchers also developed a new benchmark called ROSE-Bench. This benchmark includes both synthetic and realistic video data, covering common scenarios and the five specific side effect categories. Experimental results demonstrate that ROSE significantly outperforms existing video object erasing models and shows strong generalization capabilities to real-world video scenarios.
For more technical details, you can read the full research paper here.
Also Read:
- Enhancing Video Creation with Precise Spatial Control: Introducing SSG-DiT
- Crafting Realistic Virtual Worlds: A New Method for Large-Scale 3D Driving Scene Generation
Looking Ahead
While ROSE marks a significant step forward in video object removal, the researchers acknowledge areas for future improvement. These include optimizing for real-time performance and exploring an even broader range of environmental effects to further bridge the gap between synthetic and real-world applications. Despite some limitations, such as potential flickering artifacts under large motion and increased inference time for long videos, ROSE sets a new standard for handling complex visual artifacts in video editing.


