TLDR: Researchers have developed a new dataset called Transition-Aware Video (TAV) to improve AI models’ ability to generate videos with multiple coherent scene transitions. By post-training models like OpenSora-Plan on this dataset, which contains video clips explicitly labeled with scene changes, the models become significantly better at understanding and creating multi-scene videos from text prompts, without compromising visual quality.
Recent advancements in artificial intelligence have made incredible strides in generating video content from simple text descriptions. These models excel at creating short clips depicting a single scene, producing high-quality visuals that are often indistinguishable from real footage. However, a significant challenge remains: generating longer videos that feature coherent and natural scene transitions. Current models frequently struggle to understand when a scene change is needed based on a prompt, largely because they are primarily trained on datasets composed of single-scene video clips.
This limitation means that when a user asks for a video with multiple distinct scenes, existing open-source models often fail to deliver the correct number of transitions or maintain overall coherence. For instance, if prompted to create a video showing “Superman flying across the city, then seeing Batman fighting the Joker on a rooftop,” a typical model might only generate a single, continuous scene, or produce a jarring, incoherent shift.
Introducing the Transition-Aware Video (TAV) Dataset
To address this critical gap, researchers have proposed a novel solution: the Transition-Aware Video (TAV) dataset. This dataset is specifically designed to teach video generation models how to recognize and implement scene transitions effectively. The TAV dataset is built from preprocessed video clips that explicitly contain multiple scene transitions.
The creation of the TAV dataset involved a meticulous process. First, 10-second video clips were extracted from the Panda-70M dataset, with each clip centered around a detected scene cut. This ensures that every clip in the TAV dataset clearly showcases a transition point. To further enhance the learning process, a large language model (LLM) was employed to generate separate, detailed descriptions for each individual scene within these clips. These scene-wise descriptions were then combined into a single, explicit prompt format, such as “Previous scene: [description of scene 1]; Next scene: [description of scene 2]”. This structured prompting helps the AI model understand the explicit requirement for a scene change.
Experimenting with Post-Training
To validate the effectiveness of the TAV dataset, an experiment was conducted using the OpenSora-Plan v1.3.1 model. This state-of-the-art text-to-video model was subjected to a “post-training” phase using the newly created TAV dataset. The researchers evaluated the model’s performance across three distinct groups of prompts:
-
Group A: Prompts describing a single scene without any indication of transition (e.g., “Superman flying across the building”). This group tested the model’s ability to maintain its performance on simpler tasks.
-
Group B: Prompts implying a scene transition through two sentences, but without explicit transition keywords (e.g., “Superman is flying across the building, and then sees Batman fighting the Joker on a rooftop”).
-
Group C: Prompts explicitly instructing a scene transition using the “Previous scene: …; Next scene: …” format.
The key metrics observed included the average number of generated scenes (segments), aesthetic quality, overall consistency, dynamic degrees, and imaging quality.
Significant Improvements in Multi-Scene Generation
The results of the experiment were highly encouraging. The models post-trained on the TAV dataset showed a significant increase in their ability to generate multiple scenes. While the baseline model struggled to produce more than one scene, even with prompts explicitly requiring two, the post-trained model consistently generated an average of two or more segments for prompts in Groups B and C. This demonstrates a clear improvement in the model’s understanding of scene transition requirements.
Crucially, this enhancement in multi-scene generation did not come at the cost of visual quality. The post-trained model maintained, and in some cases even improved, dynamic consistency and temporal smoothness, leading to more coherent motion and fluid scene transitions. Aesthetic and imaging quality metrics also gradually improved during training, eventually matching or even exceeding those of the baseline model.
Furthermore, the study found that the post-trained model remained proficient at generating single-scene videos (Group A prompts), showcasing its versatility. It also demonstrated improved understanding and response to prompts that only implicitly suggested a scene change (Group B), highlighting the broader impact of the TAV dataset.
Also Read:
- Advancing Long-Form Video Analysis with Controllable Hybrid Captioning
- VideoMind: A New Dataset for Advanced Video Comprehension
Looking Ahead
This research marks a significant step towards creating more sophisticated and versatile video generation models capable of producing longer, story-driven content with seamless scene transitions. By explicitly teaching models to recognize and handle these transitions, the TAV dataset offers a promising path to overcoming a major hurdle in AI-generated video. For more in-depth information, you can read the full research paper here.


