TLDR: CAUKER is a new algorithm that generates diverse, causally coherent synthetic time series data for pretraining classification Time Series Foundation Models (TSFMs). It combines Gaussian Process kernels with Structural Causal Models. Experiments show that TSFMs pretrained solely on CAUKER’s synthetic data exhibit clear scaling laws and can achieve state-of-the-art classification performance, often matching or exceeding models trained on much larger real-world datasets, making pretraining more sample-efficient and less reliant on extensive real data collection.
In the rapidly evolving field of artificial intelligence, Time Series Foundation Models (TSFMs) have emerged as powerful tools for understanding and predicting patterns in data that changes over time. These models are crucial for applications ranging from healthcare to industrial monitoring. Traditionally, training these advanced models requires vast amounts of real-world data, which can be expensive and time-consuming to collect and prepare.
A recent research paper, titled “CAUKER: classification time series foundation models can be pretrained on synthetic data only,” introduces a groundbreaking approach that challenges this norm. Authored by Shifeng Xie, Vasilii Feofanov, Marius Alonso, Ambroise Odonnat, Jianfeng Zhang, Themis Palpanas, and Ievgen Redko, this work proposes a novel algorithm called CAUKER. The core idea behind CAUKER is to generate diverse, realistic, and causally coherent synthetic time series data, enabling the pretraining of TSFMs without relying on real-world datasets.
The Challenge of Real-World Data
Time series data, found everywhere from tracking stock prices to monitoring patient vital signs, is inherently complex. TSFMs, inspired by the success of large models in natural language processing and computer vision, aim to achieve strong performance even on data they haven’t seen before (zero-shot capabilities). However, gathering and curating the massive datasets needed for pretraining these models is a significant hurdle. Existing real-world time series classification datasets often lack the diversity required for truly robust model training, leading to irregular scaling behavior when models are trained on them.
Introducing CAUKER: Synthetic Data for Superior Training
CAUKER offers a compelling solution by generating high-quality synthetic data. This approach brings several key advantages: it eliminates the need for laborious data collection and curation, allows for the creation of arbitrarily large datasets for model scaling, and makes evaluating models on unseen data more reliable by mitigating data leakage risks. Unlike previous synthetic data generators designed for tabular data or forecasting, CAUKER specifically focuses on creating sequences with meaningful correlations and realistic temporal dependencies, crucial for classification tasks.
The CAUKER pipeline is a sophisticated blend of two powerful techniques: Gaussian Process (GP) kernel composition and Structural Causal Models (SCM). Gaussian Processes help in generating sequences with common time series patterns like trends, seasonality, and periodicity. Structural Causal Models, on the other hand, introduce rich non-linear dependencies and ensure a meaningful clustering structure within the data, which is vital for classification. By combining these, CAUKER produces data that is both temporally realistic and causally coherent, making it ideal for training classification TSFMs.
Also Read:
- SymbolBench: Assessing Large Language Models in Time Series Reasoning
- Improving Time Series Predictions with Supervised Dynamic Factor Extraction
Key Findings and Impact
The researchers conducted extensive experiments using two state-of-the-art TSFMs, Mantis and MOMENT, to evaluate CAUKER’s effectiveness. Their findings are significant:
- Superior Synthetic Data Generation: CAUKER consistently outperformed other synthetic data generation methods, including those previously used for forecasting or tabular data. This highlights the importance of a classification-tailored approach to synthetic time series generation.
- Clear Scaling Laws: A remarkable discovery was that TSFMs trained on CAUKER-generated data exhibited clear and consistent scaling laws. This means that as the dataset size (from 10K to 10M samples) or model capacity (from 1M to 783M parameters) increased, the model’s accuracy steadily improved. This contrasts sharply with real-world datasets, which showed irregular or flat scaling behavior, suggesting a lack of diversity or domain mismatch.
- Competitive Performance: Perhaps the most impactful finding is that pretraining TSFMs solely on CAUKER-generated synthetic data can lead to state-of-the-art performance in classification. In some cases, models pretrained on CAUKER data, which was significantly smaller (e.g., 100K samples for Mantis vs. 1.89M original, or 10M for MOMENT vs. 13M original), nearly matched or even surpassed the performance of models trained on much larger real-world datasets. This demonstrates the sample efficiency of CAUKER.
The study also delved into the internal workings of the models, showing that Mantis models pretrained on larger CAUKER datasets exhibited a clear trend in non-linearity and CKA scores, indicating that the model was effectively exploiting the increased data. This structural change was not observed when training on real-world UEA datasets.
This research underscores a critical insight: the quality and structure of pretraining data are paramount for the generalization performance of TSFMs. While architectural innovations are important, this work suggests that designing principled synthetic training data can yield equivalent gains. CAUKER represents a significant step towards building scalable, general-purpose time series foundation models that are less reliant on costly real-world data collection. For more in-depth details, you can refer to the full research paper available at arXiv:2508.02879.


