TLDR: A new research paper introduces REAL-V-TSFM, a novel dataset derived from real-world video using optical flow, to benchmark Time Series Foundation Models (TSFMs). The study found that while TSFMs perform well on conventional synthetic datasets, their performance significantly degrades on this new real-world video-derived data, indicating limited generalizability. The findings highlight the need for more diverse benchmarking and model designs to improve TSFMs’ ability to forecast real-world temporal dynamics.
Recent advancements in artificial intelligence have led to the development of Time Series Foundation Models (TSFMs), powerful deep learning architectures designed to understand and predict patterns in sequential data. While these models have shown impressive capabilities, particularly in areas like finance, healthcare, and urban computing, a new study highlights a critical gap in their evaluation: a lack of testing on truly real-world data.
A research paper titled “How Far Do Time Series Foundation Models Paint the Landscape of Real-World Benchmarks ?” by Lujun Li, Lama Sleem, Yiqun Wang, Yangjie Xu, Niccolò Gentile, and Radu State from the University of Luxembourg and Foyer S.A., addresses this very issue. The authors argue that most evaluations of TSFMs have relied heavily on synthetic benchmarks, which may not accurately reflect the complexities and nuances of real-world temporal dynamics.
Bridging the Gap with REAL-V-TSFM
To tackle this problem, the researchers propose a novel benchmarking approach that bridges the divide between synthetic and realistic data. Their innovative method involves extracting temporal signals directly from real-world video using a technique called optical flow. This process led to the creation of a new dataset named REAL-V-TSFM, designed to capture rich and diverse time series derived from everyday video content.
The process of creating REAL-V-TSFM is quite ingenious. It starts by collecting videos, primarily from sources like LaSOT, which feature guaranteed main subjects. These videos are then broken down into individual frames. A foreground detection method is used to isolate the main objects from the background. Following this, corner detection, using the Shi–Tomasi method, identifies key points on these subjects. The core of the time series extraction lies in the Lucas-Kanade optical flow method, which tracks the movement of these key points across consecutive frames. This allows the researchers to generate continuous motion patterns, where each key point’s x and y coordinates form independent time series. A forward-backward consistency check is also applied to ensure the reliability of these tracked trajectories.
The resulting REAL-V-TSFM dataset is notably diverse, containing over 6,000 time series from more than 600 different objects. It exhibits significant heterogeneity in length and value range, and a higher percentage of stationary series compared to conventional datasets like M4, indicating a different kind of temporal behavior. This diversity is crucial for truly testing the generalization capabilities of TSFMs.
Performance on Real-World Data
The study evaluated three state-of-the-art TSFMs: Chronos (variants like amazon/chronos-bolt-base and amazon/chronos-t5-large) and google/timesfm-2.0-500m-pytorch, alongside a Linear Regression baseline. These models were tested under zero-shot forecasting conditions on both the traditional M4 dataset and the newly proposed REAL-V-TSFM dataset, using metrics such as MAPE, sMAPE, Aggregate Relative WQL, and Aggregate Relative MASE.
The experimental results revealed a significant finding: while the TSFMs generally performed strongly on the conventional M4 dataset, outperforming the baseline, they exhibited a noticeable performance degradation when applied to REAL-V-TSFM. For instance, google/timesfm-2.0-500m-pytorch, which showed excellent performance on M4, saw a substantial increase in errors on the video-derived dataset. This suggests that despite their strong performance on synthetic or conventional benchmarks, these foundation models have limited generalizability to time series data extracted from real-world events.
Interestingly, the study also explored the impact of model size, finding that increasing the number of parameters in Chronos models led to only limited performance gains. Smaller models sometimes achieved comparable results to their larger counterparts, suggesting that scaling laws might not consistently apply to TSFMs in the same way they do for other foundation models. Furthermore, the performance varied considerably depending on the type of object in the video, with animal movements proving particularly challenging to forecast compared to inanimate objects.
Also Read:
- TIMEOMNI-1: A Unified Model for Advanced Time Series Reasoning
- Assessing Synergy in Unified AI Models: The RealUnify Benchmark
Looking Ahead
The findings of this research underscore an urgent need for data-centric benchmarking and the development of more diverse model structures to advance TSFMs toward genuine universality. The authors emphasize that future research should explore incorporating richer data modalities and enhancing model adaptability. The video-based time series data extraction pipeline introduced in this work, detailed further in the full research paper, offers a promising avenue for generating diverse and informative time series from the vast amount of available video data, paving the way for more robust and generalizable time series forecasting models.


