Time Series Models Struggle with Real-World Video Data, Study Finds

TLDR: A new research paper introduces REAL-V-TSFM, a novel dataset derived from real-world video using optical flow, to benchmark Time Series Foundation Models (TSFMs). The study found that while TSFMs perform well on conventional synthetic datasets, their performance significantly degrades on this new real-world video-derived data, indicating limited generalizability. The findings highlight the need for more diverse benchmarking and model designs to improve TSFMs’ ability to forecast real-world temporal dynamics.

Recent advancements in artificial intelligence have led to the development of Time Series Foundation Models (TSFMs), powerful deep learning architectures designed to understand and predict patterns in sequential data. While these models have shown impressive capabilities, particularly in areas like finance, healthcare, and urban computing, a new study highlights a critical gap in their evaluation: a lack of testing on truly real-world data.

A research paper titled “How Far Do Time Series Foundation Models Paint the Landscape of Real-World Benchmarks ?” by Lujun Li, Lama Sleem, Yiqun Wang, Yangjie Xu, Niccolò Gentile, and Radu State from the University of Luxembourg and Foyer S.A., addresses this very issue. The authors argue that most evaluations of TSFMs have relied heavily on synthetic benchmarks, which may not accurately reflect the complexities and nuances of real-world temporal dynamics.

Bridging the Gap with REAL-V-TSFM

To tackle this problem, the researchers propose a novel benchmarking approach that bridges the divide between synthetic and realistic data. Their innovative method involves extracting temporal signals directly from real-world video using a technique called optical flow. This process led to the creation of a new dataset named REAL-V-TSFM, designed to capture rich and diverse time series derived from everyday video content.

The process of creating REAL-V-TSFM is quite ingenious. It starts by collecting videos, primarily from sources like LaSOT, which feature guaranteed main subjects. These videos are then broken down into individual frames. A foreground detection method is used to isolate the main objects from the background. Following this, corner detection, using the Shi–Tomasi method, identifies key points on these subjects. The core of the time series extraction lies in the Lucas-Kanade optical flow method, which tracks the movement of these key points across consecutive frames. This allows the researchers to generate continuous motion patterns, where each key point’s x and y coordinates form independent time series. A forward-backward consistency check is also applied to ensure the reliability of these tracked trajectories.

The resulting REAL-V-TSFM dataset is notably diverse, containing over 6,000 time series from more than 600 different objects. It exhibits significant heterogeneity in length and value range, and a higher percentage of stationary series compared to conventional datasets like M4, indicating a different kind of temporal behavior. This diversity is crucial for truly testing the generalization capabilities of TSFMs.

Performance on Real-World Data

The study evaluated three state-of-the-art TSFMs: Chronos (variants like amazon/chronos-bolt-base and amazon/chronos-t5-large) and google/timesfm-2.0-500m-pytorch, alongside a Linear Regression baseline. These models were tested under zero-shot forecasting conditions on both the traditional M4 dataset and the newly proposed REAL-V-TSFM dataset, using metrics such as MAPE, sMAPE, Aggregate Relative WQL, and Aggregate Relative MASE.

The experimental results revealed a significant finding: while the TSFMs generally performed strongly on the conventional M4 dataset, outperforming the baseline, they exhibited a noticeable performance degradation when applied to REAL-V-TSFM. For instance, google/timesfm-2.0-500m-pytorch, which showed excellent performance on M4, saw a substantial increase in errors on the video-derived dataset. This suggests that despite their strong performance on synthetic or conventional benchmarks, these foundation models have limited generalizability to time series data extracted from real-world events.

Interestingly, the study also explored the impact of model size, finding that increasing the number of parameters in Chronos models led to only limited performance gains. Smaller models sometimes achieved comparable results to their larger counterparts, suggesting that scaling laws might not consistently apply to TSFMs in the same way they do for other foundation models. Furthermore, the performance varied considerably depending on the type of object in the video, with animal movements proving particularly challenging to forecast compared to inanimate objects.

Also Read:

Looking Ahead

The findings of this research underscore an urgent need for data-centric benchmarking and the development of more diverse model structures to advance TSFMs toward genuine universality. The authors emphasize that future research should explore incorporating richer data modalities and enhancing model adaptability. The video-based time series data extraction pipeline introduced in this work, detailed further in the full research paper, offers a promising avenue for generating diverse and informative time series from the vast amount of available video data, paving the way for more robust and generalizable time series forecasting models.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Time Series Models Struggle with Real-World Video Data, Study Finds

Bridging the Gap with REAL-V-TSFM

Performance on Real-World Data

Looking Ahead

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates