Predicting the Future: How Frozen Video Models Learn to Forecast

TLDR: A new framework uses latent diffusion models to enable generalist forecasting with pre-trained, “frozen” video models. It finds a strong correlation between a model’s perception ability and its forecasting performance, especially for models trained with temporal video data. The method predicts future features in a latent space, then decodes them for tasks like pixel, depth, point, and object motion forecasting, highlighting the importance of temporal supervision.

Anticipating what happens next is a crucial capability for any intelligent system that needs to plan or act in the real world. This paper introduces a new approach to video forecasting, demonstrating a strong link between a vision model’s ability to understand what it sees (perception) and its skill in predicting future events over short periods.

The research, titled “Generalist Forecasting with Frozen Video Models via Latent Diffusion”, explores how existing, pre-trained video models can be repurposed for forecasting without needing to be retrained from scratch. The authors, Jacob C Walker, Pedro Vélez, Luisa Polania Cabrera, Guangyao Zhou, Rishabh Kabra, Carl Doersch, Maks Ovsjanikov, João Carreira, and Shiry Ginosar, developed a novel framework that uses latent diffusion models. These models learn to predict future features within the frozen representation space of a pre-trained vision model. Once these future features are predicted, lightweight, task-specific “readout heads” decode them into understandable outputs like future pixels, depth maps, or object movements.

A key insight from this study is that a model’s forecasting ability is highly correlated with its perception performance, especially for models that were originally trained with temporal video supervision. This means that models good at understanding current video content are also generally good at predicting future content. Interestingly, video synthesis models, which are designed to generate video, often match or even surpass the forecasting performance of models trained with mask-based objectives.

The paper highlights three main challenges in video forecasting: the inherent uncertainty of the future (multiple possibilities can unfold), the need to model continuous trajectories over time, and the requirement to predict at various semantic levels (from low-level pixels to high-level object abstractions). Their diffusion-based method addresses these by generating diverse samples to capture uncertainty, modeling full temporal trajectories, and enabling predictions across different targets like pixels, depth, point tracks, and bounding boxes.

The framework operates in two stages. First, a lightweight readout head is trained for each specific task (e.g., predicting point tracks) to map the frozen video representations to the desired output. Second, a diffusion model is trained to forecast future latent trajectories directly in the space of these frozen video representations. During evaluation, these forecasted representations are then passed through the trained readout heads to assess their quality in the context of the downstream task.

The researchers evaluated their framework across four diverse tasks: forecasting future RGB pixels in ScanNet, predicting future depth maps in ScanNet, tracking dense visual features (point tracks) in the Perception Test dataset, and forecasting object bounding box locations in the Open Waymo dataset. They benchmarked a wide array of image and video models, including DINOv2, SigLIP, VideoMAE, and W.A.L.T.

The results showed that models trained exclusively on static images, like DINOv2 and SigLIP, generally performed poorly in forecasting, emphasizing the critical role of temporal context in learning generalizable video representations. Language supervision alone also did not appear to significantly improve forecasting. W.A.L.T., a video synthesis model, excelled at pixel and depth forecasting, tasks closely aligned with its generative training. However, its performance on more structured tasks like point and bounding box tracking was mixed compared to masked modeling approaches of similar size.

Also Read:

This research provides a unified framework for evaluating forecasting capabilities in frozen video models, bridging the gap between representation learning and generative modeling. It suggests that understanding the present is a strong foundation for predicting the future in video. You can find more details in the full research paper: Generalist Forecasting with Frozen Video Models via Latent Diffusion.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Predicting the Future: How Frozen Video Models Learn to Forecast

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AI’s Hyper-Growth Unlocked: OpenAI’s $500B Valuation Forces a Capital Re-evaluation for Investors

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates