TLDR: Geometry Forcing (GF) is a novel method that enhances video diffusion models by aligning their internal representations with features from a pre-trained 3D geometric foundation model (VGGT). This approach, using Angular and Scale Alignment objectives, enables video models to internalize 3D awareness, leading to significantly improved visual quality, 3D consistency, and reduced long-term drift in generated videos. Experiments show GF outperforms baselines on various video generation tasks, making AI-generated videos more realistic and coherent.
Videos are everywhere, from social media to advanced simulations, but creating truly realistic and consistent video content, especially when it involves complex movements or changes in perspective, remains a significant challenge for AI. Current video generation models, while impressive, often struggle with a fundamental aspect: understanding the underlying 3D world that videos represent. They tend to focus on generating pixels, which can lead to visual inconsistencies and a lack of geometric coherence over time.
A new research paper titled Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling introduces an innovative approach called ‘Geometry Forcing’ (GF) that aims to bridge this gap. Developed by researchers Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, and Jiang Bian from Microsoft Research and Tsinghua University, this method encourages video diffusion models to internalize a deeper understanding of 3D space.
The Core Problem: Missing 3D Awareness
Imagine a video of a camera panning around a room. A typical video generation model might create a sequence of frames that look realistic individually, but as the camera moves, objects might subtly change shape, or the scene might not connect seamlessly when the camera returns to its starting point. This happens because these models often treat videos as a series of 2D images, without truly grasping that they are projections of a dynamic 3D environment.
The researchers observed that even advanced video diffusion models, when trained solely on raw video data, fail to encode meaningful geometric information. When they tried to reconstruct 3D depth maps from the internal features of these models, the results were often nonsensical, highlighting a critical missing piece in their understanding of the visual world.
Geometry Forcing: A 3D Compass for Video Models
To address this, Geometry Forcing introduces a clever mechanism to guide video diffusion models. The core idea is to align the intermediate representations (the ‘thoughts’ or ‘understandings’ of the video model as it processes information) with features from a pre-trained ‘geometric foundation model’ called VGGT (Visual Geometry Grounded Transformer). VGGT is specifically designed to understand 3D properties like camera poses, depth maps, and 3D point tracks from images.
Geometry Forcing uses two complementary alignment objectives:
-
Angular Alignment: This ensures that the ‘direction’ or orientation of the video model’s internal features matches that of the geometric features from VGGT. It’s like teaching the model to understand the spatial relationships between objects.
-
Scale Alignment: While angular alignment handles direction, scale alignment focuses on preserving the ‘size’ or magnitude of geometric information. This helps the model understand how large objects are and their distances, preventing distortions.
By combining these two objectives, Geometry Forcing provides a stable and effective way to inject 3D awareness directly into the video generation process, without needing extensive new 3D annotations for every video.
Impressive Results and Real-World Impact
The effectiveness of Geometry Forcing was rigorously tested on various video generation tasks, including camera view-conditioned and action-conditioned scenarios. The results were compelling:
-
Improved Visual Quality: GF significantly reduced the Fréchet Video Distance (FVD), a key metric for video quality, from 364 to 243 on the RealEstate10K dataset for long-term video generation. This indicates more realistic and coherent videos.
-
Enhanced 3D Consistency: Metrics like Reprojection Error (RPE) and Revisit Error (RVE) showed substantial improvements, confirming that the generated videos maintained better geometric accuracy and temporal stability.
-
Qualitative Superiority: In visual comparisons, videos generated with GF consistently maintained scene coherence and object shapes, even during complex camera movements like a full 360-degree rotation. Unlike baseline models that often showed drift or implausible changes, GF could accurately ‘revisit’ the starting viewpoint.
-
Generalizability: The method also showed strong performance when applied to out-of-domain data, such as generating videos in a Minecraft environment, demonstrating its robustness.
-
Mitigating Exposure Bias: A common problem in autoregressive video generation is ‘exposure bias,’ where small errors accumulate over time. GF helps mitigate this by providing consistent 3D guidance, leading to more stable long-term video synthesis.
A user study further validated these findings, with participants rating GF-generated videos higher across aspects like Camera Following, Object Consistency, and Scene Continuity.
Also Read:
- Unlocking 3D Texture Creation with Video Foundation Models: Introducing SeqTex
- StreamDiT: Enabling Live and Interactive Video Creation from Text
Looking Ahead
While Geometry Forcing marks a significant step forward, the researchers acknowledge that its full potential on even larger models and more extensive datasets is yet to be explored. Future work includes scaling GF to build more robust 3D-consistent world simulators and leveraging 3D representations as a form of ‘persistent memory’ for generating ultra-long videos. This research paves the way for more immersive and geometrically accurate AI-generated visual content, bringing us closer to truly intelligent systems that understand and simulate the physical world.


