TLDR: Align4Gen is a new framework that improves video diffusion model training by aligning the video generator’s internal features with those from pre-trained self-supervised vision encoders. It introduces a metric (IICR) to select optimal encoders and fuses complementary features from image-trained models like DINOv2 and SAM2.1 Hiera. This approach significantly enhances video generation quality, accelerates training, and reduces computational costs for both unconditional and class-conditional video tasks.
Video generation models have made incredible strides in recent years, allowing us to create high-resolution, photorealistic videos that can last for several minutes. These advancements are often attributed to breakthroughs in model architectures, like the shift from U-Nets to Diffusion Transformers (DiT), and innovative training methods such as flow matching. However, a recent research paper highlights a less explored area: improving the underlying feature representation power of these video diffusion models.
The paper, titled “Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders,” by Dohun Lee, Hyeonho Jeong, Jiwook Kim, Duygu Ceylan, and Jong Chul Ye, introduces a novel framework called Align4Gen. This framework proposes that video diffusion models can significantly benefit from aligning their internal, intermediate features with the rich feature representations learned by pre-trained self-supervised vision encoders.
The Challenge of Feature Representation
While generative models are powerful, their learned features are often not as discriminative as those from self-supervised vision models like DINOv2. Interestingly, these two types of features tend to complement each other. Building on this observation, the researchers hypothesized that using pre-trained vision models as a guide during video diffusion model training could lead to the generation of more discriminative features and, consequently, higher quality videos.
Introducing a New Metric: IICR
To identify the most suitable vision encoders for this guidance, the team developed a new metric called the Intra-Inter Consistency Ratio (IICR). This metric quantifies two crucial properties: a vision encoder’s discriminative power (how well it distinguishes between different objects) and its temporal consistency (how stable its features are for the same object across different frames in a video). A higher IICR indicates features that are both highly discriminative and temporally stable.
Surprisingly, their analysis revealed that image-trained encoders, such as DINOv2 and SAM2.1 Hiera, often exhibit superior discriminability and temporal consistency compared to dedicated video or 3D encoders like VideoMAE and DUSt3R. This finding was a key insight, guiding their choice of encoders for Align4Gen.
Multi-Feature Fusion for Comprehensive Understanding
Further analysis showed that different pre-trained image features capture different aspects of an image. For instance, DINOv2 primarily focuses on low-frequency, semantic structures, while SAM2.1 Hiera is more sensitive to high-frequency details. Recognizing this complementary nature, Align4Gen incorporates a multi-feature fusion strategy. It combines the strengths of multiple image encoders by concatenating their normalized feature representations, creating a richer, multi-frequency supervisory signal. This ensures that the video diffusion model learns both coarse semantic information and fine-grained details.
How Align4Gen Works
At its core, Align4Gen works by enforcing consistency between the patch-level tokens of a video diffusion transformer (V-DiT) and the fused representations from the pre-trained image encoders. A lightweight neural network (a multi-layer perceptron) maps the V-DiT’s tokens into the feature space of the external encoders, and an alignment loss function minimizes the distance between them. This alignment term is added to the standard video denoising loss, balancing the generation task with the feature learning guidance.
Impressive Results and Faster Training
The researchers rigorously evaluated Align4Gen on both unconditional and class-conditional video generation tasks. The results were compelling: Align4Gen consistently led to significant improvements in video generation quality, as measured by metrics like Fréchet Video Distance (FVD) and Fréchet Inception Distance (FID). For example, on the UCF-101 dataset, the fusion variant of Align4Gen achieved better performance at 400,000 training steps than the baseline model at 1 million steps, indicating a more than 2.5 times faster convergence rate and substantial computational savings.
Qualitative results also showed that Align4Gen produces sharper, more coherent videos with smoother motion transitions. The benefits were particularly pronounced for subject-centric datasets like FaceForensics, where the discriminability of the vision encoder plays a more critical role. The paper also includes detailed ablation studies, confirming the robustness of their design choices regarding denoising objectives, alignment layer depth, and the effectiveness of their multi-feature fusion over alternative integration methods.
Also Read:
- Generating Realistic Robot Actions: Introducing ManipDreamer3D for 3D-Aware Manipulation Videos
- Video Summarization Gets Smarter with Language-Guided Object Interactions
Conclusion
Align4Gen represents a significant step forward in making video diffusion model training more efficient and effective. By strategically leveraging the power of pre-trained self-supervised vision encoders and a novel multi-feature fusion approach, it not only enhances video generation quality but also dramatically accelerates the training process. This work paves the way for more cost-efficient and high-performing video generation models in the future. You can read the full research paper here: Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders.


