TLDR: UniCast is a new framework that significantly improves time series forecasting by integrating visual and textual data with traditional time series models. It uses a parameter-efficient “soft prompt tuning” method to adapt pre-trained models, allowing them to leverage rich multimodal context without extensive retraining. Experiments show UniCast consistently outperforms existing models, demonstrating the critical role of combining different data types for more accurate and robust predictions, even with limited data.
Time series forecasting, a critical task in fields like finance, healthcare, and climate science, traditionally relies on models that process numerical data in isolation. However, real-world time series often come with rich supplementary information, such as images and text, which current models largely ignore. This oversight can limit their ability to make accurate and robust predictions.
A groundbreaking new framework called UniCast addresses this limitation by introducing a unified multimodal prompting approach for Time Series Foundation Models (TSFMs). Developed by researchers from Pohang University of Science and Technology and The University of Melbourne, UniCast is designed to leverage not only time series data but also accompanying visual and textual signals, significantly enhancing forecasting performance.
What is UniCast?
UniCast stands out as a novel, parameter-efficient framework that extends existing TSFMs. Unlike traditional methods that operate in a “unimodal” setting (focusing on one type of data), UniCast integrates three key modalities: time series, vision, and text. The core idea is to combine the strengths of large-scale pre-trained TSFMs with the contextual richness provided by visual and textual information.
How UniCast Works
The ingenuity of UniCast lies in its “soft prompt tuning” mechanism. Instead of fully retraining massive foundation models, which is computationally expensive and risks overfitting, UniCast introduces small, trainable vectors called “soft prompts.” These prompts act as guides, allowing the pre-trained models to adapt to new, multimodal inputs while keeping their vast majority of parameters frozen.
Here’s a breakdown of its components:
- Vision Prompt: UniCast takes visual representations of time series data, such as plots, and feeds them into a pre-trained Vision Encoder (like CLIP or BLIP). Soft prompts are injected into this encoder to help it understand how visual patterns relate to forecasting tasks.
- Text Prompt: Similarly, textual information, such as dataset descriptions or metadata, is processed by a pre-trained Text Encoder (like Qwen or LLaMA). Text prompts guide this encoder to extract relevant semantic context.
- Time-Series Prompt: The raw time series data is processed by the TSFM. Additional time-series prompts are introduced within the TSFM to help it effectively integrate the visual and textual embeddings alongside its own temporal analysis.
- Cross-Modality Interaction: A crucial part of UniCast is how it brings these different types of information together. Learnable projection layers map the outputs from the vision and text encoders into the same “embedding space” as the TSFM. This ensures that the diverse data types can be seamlessly combined and understood by the TSFM for a unified forecasting process.
Also Read:
- MOVER: A New Framework for Structured Multimodal AI Understanding
- Optimizing Text-to-Image Fine-tuning: A New Framework for Model Selection
Key Findings and Advantages
Extensive experiments across various time series forecasting benchmarks, including datasets from finance, healthcare, energy, and retail, demonstrate UniCast’s superior performance. It consistently and significantly outperforms all existing unimodal TSFM baselines. For instance, UniCast variants, using either Timer or Chronos as their backbone TSFM, achieved the lowest average Mean Squared Error (MSE) across eight diverse datasets.
A key insight from the research is that incorporating both visual and textual context leads to substantial performance improvements. While each modality contributes individually, their combination yields complementary gains, suggesting they provide distinct and valuable cues for prediction. The framework also proves to be highly parameter-efficient, meaning it achieves these gains with minimal updates to the model’s parameters, making it scalable and practical for real-world deployment.
Furthermore, the study showed that injecting prompts deeper and more broadly into the model layers generally leads to better performance. UniCast also demonstrates remarkable data efficiency, achieving strong results even when trained with only a fraction of the available data, and converges rapidly within a few training epochs. This makes UniCast a robust and practical solution, especially in scenarios where training data might be limited.
In conclusion, UniCast represents a significant step forward in time series forecasting by effectively integrating multimodal context. This approach paves the way for a new generation of general-purpose, context-aware forecasters capable of operating more effectively in complex, real-world environments. You can read the full research paper here.


