TLDR: A new study reveals that pre-trained AI models for symbolic regression struggle to generalize to data outside their training distribution, performing well in familiar scenarios but failing on novel or perturbed data. This limits their real-world applicability and highlights the need for more robust, generalizable AI architectures, potentially through hybrid approaches combining AI with traditional search methods.
Symbolic regression, a field focused on discovering mathematical expressions that explain given data, has seen a significant shift with the advent of transformer-based models. These models promise a scalable approach by moving the computationally intensive search for formulas into a large-scale pre-training phase. However, a recent study delves into a critical question: how well do these pre-trained models truly generalize to problems beyond their initial training data?
The Generalization Gap Revealed
The research, titled “Analyzing Generalization in Pre-Trained Symbolic Regression,” conducted by Henrik Voigt, Paul Kahlmeyer, Kai Lawonn, Michael Habeck, and Joachim Giesen, reveals a significant limitation. While transformer-based symbolic regression models perform well on problems that are “in-distribution” (similar to their pre-training data), their performance consistently declines when faced with “out-of-distribution” (OOPD) scenarios. This “generalization gap” is identified as a major hurdle for practical applications.
Understanding the Testing Approach
To rigorously evaluate generalization, the researchers tested models across different data regimes:
- In-Pre-training Domain (IPD): Training data sampled from the same numerical domain used for the model’s initial pre-training.
- Within-Pre-training Domain (WIPD): Training data sampled from a smaller area completely inside the pre-training domain.
- Out-of-Pre-training Domain (OOPD): Training data sampled from an area explicitly separate from the pre-training domain, directly testing extrapolation.
They also distinguished between “in-domain” (ID) and “out-of-domain” (OOD) accuracy for testing the generated formulas, where ID tests on data from the same domain as training, and OOD tests on data from a different domain.
Key Findings: A Tale of Two Performances
The study compared five transformer-based approaches and one hybrid model against PySR, a state-of-the-art search-based method, using the SRBench benchmark.
Formula Recovery: Search-based methods like PySR demonstrated superior and consistent ability to recover the true underlying formulas, regardless of the data domain. In contrast, transformer models showed significantly lower recovery rates, especially in OOPD scenarios. For instance, one model’s recovery rate dropped from 31% in IPD to just 4% in OOPD. This suggests that while transformers can be fast, they often struggle to identify the exact mathematical structure when faced with unfamiliar data.
Predictive Accuracy: When it came to simply predicting output values (accuracy), some transformer models (like E2E) showed high accuracy within the training data’s range (ID accuracy), even for OOPD training data. However, this accuracy often plummeted when asked to predict values outside the training data’s range (OOD accuracy). This indicates that these models might be excellent “curve-fitters” – adept at interpolating within known patterns – but they don’t necessarily learn the underlying generalizable rules.
Fragility to Real-World Perturbations: A critical finding was the extreme brittleness of pre-trained models to common data variations. Even minor shifts in input scaling, input domain, or the addition of small amounts of output noise led to a significant degradation or catastrophic failure in performance. This highlights a major challenge for deploying these models in real-world scientific applications, where data is rarely perfectly clean or perfectly aligned with pre-training conditions.
Complexity of Solutions: The study also noted that some transformer models, particularly E2E, tended to produce highly complex symbolic expressions (averaging 20-21 operators). While these complex formulas might achieve high in-domain accuracy, their interpretability is severely limited, contrasting with search-based methods that often find simpler, more understandable expressions.
Why the Generalization Gap Exists
The researchers propose that current transformer-based approaches primarily function as “pattern matchers” rather than learning fundamental, compositional mathematical building blocks. They appear to memorize mappings between statistical properties of data points and symbolic structures. When a new problem’s data doesn’t closely match these memorized “snapshots,” performance suffers.
Even technical fixes like input standardization, while offering some robustness against domain shifts, do not fundamentally solve this problem. Standardization merely transforms the numerical range; it doesn’t help the model reason about structural relationships it hasn’t seen before.
Also Read:
- Unpacking LLM Struggles with Causality: The Role of Uncertainty and Overconfidence
- Unlocking Scientific Discovery: Evaluating LLMs on Inductive Reasoning Beyond Equations
Implications for Practitioners and Future Directions
For scientists and engineers, the findings offer clear guidance:
- If your data is expected to be very similar to a model’s pre-training domain, transformer-based methods can be a fast way to generate initial hypotheses.
- However, for novel problems or data that deviates from the pre-training distribution, relying solely on these models carries significant risk. The generated formulas might seem accurate locally but fail to generalize, potentially leading to misleading conclusions. In such real-world scenarios, robust search-based methods remain the more reliable choice.
Future research must prioritize developing models that can pass rigorous out-of-distribution generalization tests. A promising avenue lies in creating hybrid approaches that combine the structured exploration of search-based algorithms with the rapid inference capabilities of transformers, ensuring that the learned “prior knowledge” is robust enough to guide the search effectively in diverse scenarios.
In conclusion, while transformer models have brought impressive scalability to symbolic regression, their practical utility is currently constrained by their reliance on pre-training data distribution. Bridging this generalization gap is the central challenge for the next generation of symbolic regression tools.


