spot_img
HomeResearch & DevelopmentPredicting the Future: How Frozen Video Models Learn to...

Predicting the Future: How Frozen Video Models Learn to Forecast

TLDR: A new framework uses latent diffusion models to enable generalist forecasting with pre-trained, “frozen” video models. It finds a strong correlation between a model’s perception ability and its forecasting performance, especially for models trained with temporal video data. The method predicts future features in a latent space, then decodes them for tasks like pixel, depth, point, and object motion forecasting, highlighting the importance of temporal supervision.

Anticipating what happens next is a crucial capability for any intelligent system that needs to plan or act in the real world. This paper introduces a new approach to video forecasting, demonstrating a strong link between a vision model’s ability to understand what it sees (perception) and its skill in predicting future events over short periods.

The research, titled “Generalist Forecasting with Frozen Video Models via Latent Diffusion”, explores how existing, pre-trained video models can be repurposed for forecasting without needing to be retrained from scratch. The authors, Jacob C Walker, Pedro Vélez, Luisa Polania Cabrera, Guangyao Zhou, Rishabh Kabra, Carl Doersch, Maks Ovsjanikov, João Carreira, and Shiry Ginosar, developed a novel framework that uses latent diffusion models. These models learn to predict future features within the frozen representation space of a pre-trained vision model. Once these future features are predicted, lightweight, task-specific “readout heads” decode them into understandable outputs like future pixels, depth maps, or object movements.

A key insight from this study is that a model’s forecasting ability is highly correlated with its perception performance, especially for models that were originally trained with temporal video supervision. This means that models good at understanding current video content are also generally good at predicting future content. Interestingly, video synthesis models, which are designed to generate video, often match or even surpass the forecasting performance of models trained with mask-based objectives.

The paper highlights three main challenges in video forecasting: the inherent uncertainty of the future (multiple possibilities can unfold), the need to model continuous trajectories over time, and the requirement to predict at various semantic levels (from low-level pixels to high-level object abstractions). Their diffusion-based method addresses these by generating diverse samples to capture uncertainty, modeling full temporal trajectories, and enabling predictions across different targets like pixels, depth, point tracks, and bounding boxes.

The framework operates in two stages. First, a lightweight readout head is trained for each specific task (e.g., predicting point tracks) to map the frozen video representations to the desired output. Second, a diffusion model is trained to forecast future latent trajectories directly in the space of these frozen video representations. During evaluation, these forecasted representations are then passed through the trained readout heads to assess their quality in the context of the downstream task.

The researchers evaluated their framework across four diverse tasks: forecasting future RGB pixels in ScanNet, predicting future depth maps in ScanNet, tracking dense visual features (point tracks) in the Perception Test dataset, and forecasting object bounding box locations in the Open Waymo dataset. They benchmarked a wide array of image and video models, including DINOv2, SigLIP, VideoMAE, and W.A.L.T.

The results showed that models trained exclusively on static images, like DINOv2 and SigLIP, generally performed poorly in forecasting, emphasizing the critical role of temporal context in learning generalizable video representations. Language supervision alone also did not appear to significantly improve forecasting. W.A.L.T., a video synthesis model, excelled at pixel and depth forecasting, tasks closely aligned with its generative training. However, its performance on more structured tasks like point and bounding box tracking was mixed compared to masked modeling approaches of similar size.

Also Read:

This research provides a unified framework for evaluating forecasting capabilities in frozen video models, bridging the gap between representation learning and generative modeling. It suggests that understanding the present is a strong foundation for predicting the future in video. You can find more details in the full research paper: Generalist Forecasting with Frozen Video Models via Latent Diffusion.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -