spot_img
HomeResearch & DevelopmentEvaluating LLMs as General Predictors for Small Tabular Datasets

Evaluating LLMs as General Predictors for Small Tabular Datasets

TLDR: A study empirically investigates the capability of Large Language Models (LLMs) like GPT-5 and Gemini-2.5-Flash to act as universal predictors on small tabular datasets for classification, regression, and clustering tasks using in-context learning. The findings show LLMs achieve strong, competitive performance in classification, establishing practical zero-training baselines. However, their performance in regression with continuous outputs and in clustering is significantly poor compared to traditional machine learning models, indicating limitations in handling numerical precision and unsupervised data structures. The research suggests LLMs are valuable for rapid data exploration and classification but require further development for other tabular tasks.

Large Language Models (LLMs), initially developed for understanding and generating human language, are increasingly being explored for their potential beyond traditional text-based tasks. A recent empirical study delves into whether these powerful models can act as “universal predictors” for structured, non-linguistic data, specifically focusing on small tabular datasets.

The research, titled Large Language Models as Universal Predictors? An Empirical Study on Small Tabular Datasets, was conducted by Nikolaos Pavlidis, Vasilis Perifanis, Symeon Symeonidis, and Pavlos S. Efraimidis. Their work investigates the ability of state-of-the-art LLMs, including GPT-5, GPT-4o, GPT-o3, Gemini-2.5-Flash, and DeepSeek-R1, to perform predictive tasks on small-scale tabular data for classification, regression, and clustering, leveraging their in-context learning (ICL) capabilities without explicit fine-tuning.

LLMs Excel in Classification Tasks

One of the most significant findings of the study is the strong performance of LLMs in classification tasks, particularly when data availability is limited. The models demonstrated competitive accuracy and F1-scores, often rivaling or even surpassing established machine learning (ML) baselines like Logistic Regression, Random Forest, and gradient-boosting algorithms (LightGBM, CatBoost), as well as tabular foundation models (TFMs) like TabPFN and TabICL. For instance, on the Iris dataset, many LLMs achieved accuracies above 0.96, suggesting they can serve as effective zero-training baselines for clean, low-dimensional classification problems. In the Bankrupt dataset, LLMs like GPT-5 and DeepSeek even achieved perfect accuracy, matching the best traditional models.

Struggles with Regression and Clustering

In stark contrast to their classification prowess, LLMs showed significant limitations in regression tasks, which involve predicting continuous-valued outputs. The study revealed a substantial performance gap, with LLMs often producing much higher error rates (Mean Absolute Error, Mean Squared Error) and significantly lower R2 scores (some even negative, indicating performance worse than a simple mean predictor) compared to traditional regression algorithms. This suggests that LLMs, in their current form, struggle with the numerical precision required for continuous value prediction, a challenge likely stemming from their autoregressive token generation process and lack of an explicit objective function tied to regression error.

Similarly, LLMs exhibited mixed and generally poor performance in clustering tasks, which involve grouping similar data points without predefined labels. Standard clustering algorithms consistently outperformed LLM-derived embeddings. While some LLM configurations showed moderate success on simpler datasets like Mall, this did not generalize to more complex scenarios like Wholesale or Moon, where LLMs often produced negative silhouette scores, indicating poor cluster separation. The researchers attribute this to LLM representations being optimized for general semantic similarity rather than the specific variance-covariance structures that clustering algorithms exploit.

Insights from Ablation Studies

The study also included ablation experiments to understand factors influencing LLM performance. Varying the fraction of training data showed that while LLMs can achieve high performance with few examples, more data did not always lead to consistent improvements, sometimes even causing performance degradation due to potential overfitting to noisy patterns. Tabular Foundation Models, particularly TabPFN, demonstrated greater stability across different data sizes. Additionally, the choice of serialization format for tabular data within the prompt was found to influence LLM performance, with some models like DeepSeek showing better results when data was presented in a structured JSON format compared to CSV or key:value pairs.

Also Read:

Conclusion: A Promising Start with Clear Limitations

The research concludes that while LLMs hold considerable promise as function approximators for classification tasks on tabular data, offering a rapid and low-overhead alternative to traditional ML pipelines, their current capabilities are not universal. They demonstrate clear strengths in discrete prediction problems, making them valuable for business intelligence and exploratory analytics where quick insights are needed. However, significant limitations remain in handling the numerical precision required for regression and the unsupervised pattern recognition for clustering. Further research is essential to address these shortcomings and unlock the full potential of LLMs across all types of structured data tasks.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -