TLDR: ST-VFM is a novel framework that adapts Vision Foundation Models (VFMs), originally trained on images, for complex spatio-temporal forecasting tasks like traffic or crowd prediction. It addresses VFM limitations in temporal modeling and data compatibility through a dual-branch input (raw data and temporal flow) and a two-stage reprogramming process. This allows VFMs to effectively learn intricate spatial and temporal patterns, achieving superior accuracy across various datasets without requiring costly task-specific pre-training.
Spatio-temporal forecasting, which involves predicting future dynamics by understanding how things interact across space and time, is crucial for many real-world applications. Imagine predicting traffic jams, understanding human movement in a city, or forecasting cellular network usage. Traditionally, deep learning models have been used for these tasks, but they often need to be custom-built and extensively retrained for each specific scenario, which limits their flexibility and scalability.
Recently, large language models (LLMs) have shown promise in time-series forecasting. However, LLMs are primarily designed to understand one-dimensional sequences of text. This makes them less effective at capturing the rich, multi-dimensional spatial and temporal relationships inherent in spatio-temporal data.
This is where Vision Foundation Models (VFMs) come into play. VFMs, like those used in advanced image recognition, are incredibly powerful at understanding spatial patterns because they are trained on massive datasets of images. Given that spatio-temporal data often looks like a sequence of images (e.g., a grid of traffic data over time), VFMs seem like a natural fit. However, there are two main hurdles: VFMs aren’t inherently designed to model temporal changes, and the raw spatio-temporal data doesn’t always look like the typical images VFMs are used to seeing.
To overcome these challenges, researchers have developed a novel framework called ST-VFM. This framework systematically ‘reprograms’ existing Vision Foundation Models for general-purpose spatio-temporal forecasting. ST-VFM uses a clever dual-branch architecture. One branch processes the raw spatio-temporal data, capturing the spatial layouts. The other branch processes ‘ST Flow’ inputs, which are simplified representations of temporal differences, essentially encoding how things change over time as dynamic spatial cues.
How ST-VFM Works: Two Stages of Reprogramming
ST-VFM employs a two-stage reprogramming strategy to adapt VFMs without modifying their core, pre-trained structure:
The first stage, called **Pre-VFM Reprogramming**, prepares the data for the VFM. It uses a ‘Temporal-Aware Token Adapter’ that takes both the raw spatio-temporal data and the ST Flow data and converts them into a format that the VFM can understand. This adapter also embeds temporal context, ensuring that the VFM can interpret changes over time as if they were spatial patterns. It also includes a learnable positional embedding to help the model understand the specific spatial layout of the spatio-temporal grids, which are often much smaller than typical image sizes.
The second stage, **Post-VFM Reprogramming**, introduces a ‘Bilateral Cross-Prompt Coordination’ module. After the VFM processes the adapted inputs from both branches, this module allows the two branches to interact dynamically. Each branch uses ‘prompts’ derived from the other branch’s output to enrich its own understanding. For example, the ST Flow branch’s output can inform the raw ST branch about dynamic patterns, while the raw ST branch’s output can provide stable spatial context to the ST Flow branch. This cross-referencing helps the model better disentangle and leverage both motion and appearance information for improved forecasting.
Additionally, ST-VFM uses an auxiliary ‘flow forecasting’ objective during training. While the main task is to predict future spatio-temporal states, the model also learns to predict future temporal differences (flow maps). This extra task helps the model explicitly separate static spatial context from dynamic temporal changes, acting as a regularizer and improving overall robustness.
Also Read:
- Unveiling the Continuous Nature of Time Series with NeuTSFlow
- Boosting Time Series Forecasts with Foundation Models and Conformal Prediction
Impressive Results and Broad Applicability
Extensive experiments were conducted on ten diverse spatio-temporal datasets, covering various domains like traffic speed, crowd flow, taxi demand, and cellular usage. ST-VFM consistently outperformed state-of-the-art baseline methods across all datasets and metrics, often by a significant margin. This demonstrates the effectiveness and robustness of the approach.
The framework also proved flexible across different VFM backbones, including DINO, CLIP, and DEIT, all of which showed strong performance. Interestingly, even though video foundation models (like VideoMAE) are designed for spatio-temporal data, ST-VFM, using image-pretrained VFMs, still outperformed them. This suggests that the rich spatial priors learned from vast image datasets, combined with ST-VFM’s clever temporal adaptation, are more effective for these forecasting tasks.
In conclusion, ST-VFM offers a powerful and general framework for spatio-temporal forecasting. By intelligently reprogramming existing Vision Foundation Models, it eliminates the need for costly, task-specific pre-training on spatio-temporal data, opening up new possibilities for leveraging general visual intelligence in dynamic prediction tasks. You can read the full research paper here.


