Unlocking Time Series AI with Synthetic Series-Symbol Data

TLDR: A new research paper introduces a method to generate high-quality synthetic time series data paired with mathematical symbolic expressions, addressing the common problem of data scarcity in time series analysis. They developed SymTime, a foundation model that uses this “series-symbol” data to learn robust representations, achieving strong performance across various time series tasks like forecasting and anomaly detection, comparable to models trained on real-world data.

Foundation models are transforming artificial intelligence, but their development in time series analysis (TSA) faces a significant hurdle: a shortage of high-quality training data. Unlike fields like computer vision or natural language processing, real-world time series datasets are often smaller and suffer from imbalances, particularly in critical areas like finance and healthcare. This scarcity can limit a model’s ability to generalize and perform well on diverse tasks.

A recent research paper, titled “Synthetic Series-Symbol Data Generation for Time Series Foundation Models,” introduces an innovative solution to this data challenge. Authored by Wenxuan Wang, Kai Wu, Yujian Betterest Li, Dan Wang, and Xiaoyu Zhang, the paper proposes a novel approach to generate unlimited, high-quality synthetic time series data. This method, called the series-symbol (S2) dual-modality data generation mechanism, is inspired by theories of complex dynamic systems, which suggest that time series are representations of underlying mathematical processes.

The core idea is to create time series data alongside their corresponding symbolic expressions—essentially, the mathematical formulas that describe them. By continuously constructing diverse symbolic expressions, the researchers can generate a vast array of time series with rich and varied properties. This synthetic data effectively addresses the problem of training data scarcity and imbalance that plagues real-world datasets.

To leverage this unique S2 dataset, the researchers developed SymTime, a pre-trained foundation model designed specifically for time series analysis. SymTime integrates both time series representations and symbolic semantic information. It consists of a time series encoder, a symbol encoder (based on the DistilBERT architecture), and momentum encoders. The model is pre-trained using a combination of objectives: masked time series modeling (where parts of the time series are hidden and reconstructed), masked language modeling (for the symbolic expressions), and a crucial series-symbol contrastive learning mechanism that aligns the representations of correlated time series and their symbolic counterparts.

The experimental results are compelling. The S2 dataset was shown to comprehensively cover diverse time series characteristics, including stationarity, predictability, frequency, seasonality, and trend, matching and even surpassing the diversity of real-world datasets. SymTime, pre-trained on this synthetic data, demonstrated competitive performance across five major TSA tasks: long-term forecasting, short-term forecasting, classification, imputation, and anomaly detection. It rivaled foundation models pre-trained on real-world data, often with a more lightweight architecture.

Furthermore, the research confirmed that the size of the pre-training dataset directly correlates with SymTime’s performance. As the scale of the synthetic S2 dataset increased, the model’s performance on downstream tasks progressively improved, validating the effectiveness of the data generation strategy in alleviating data scarcity. Ablation studies also highlighted the importance of each pre-training objective, particularly emphasizing how symbolic information enhances the model’s ability to learn robust temporal representations.

Visualizations of the model’s internal representations showed that after pre-training, SymTime’s encoders formed clear, operator-specific clusters for both time series and symbolic expressions, indicating that the model successfully learned the semantic correspondence between them. This ability also translated into zero-shot imputation capabilities on both synthetic and real-world time series data.

Also Read:

This work underscores the significant potential of generating synthetic, dual-modality data for training powerful time series foundation models. While the current S2 generation mechanism has some limitations, such as excluding complex differential and integral operations due to computational challenges, future work aims to integrate these to further enrich the diversity of symbolic expressions. For more technical details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Time Series AI with Synthetic Series-Symbol Data

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates