Evaluating LLMs as General Predictors for Small Tabular Datasets

TLDR: A study empirically investigates the capability of Large Language Models (LLMs) like GPT-5 and Gemini-2.5-Flash to act as universal predictors on small tabular datasets for classification, regression, and clustering tasks using in-context learning. The findings show LLMs achieve strong, competitive performance in classification, establishing practical zero-training baselines. However, their performance in regression with continuous outputs and in clustering is significantly poor compared to traditional machine learning models, indicating limitations in handling numerical precision and unsupervised data structures. The research suggests LLMs are valuable for rapid data exploration and classification but require further development for other tabular tasks.

Large Language Models (LLMs), initially developed for understanding and generating human language, are increasingly being explored for their potential beyond traditional text-based tasks. A recent empirical study delves into whether these powerful models can act as “universal predictors” for structured, non-linguistic data, specifically focusing on small tabular datasets.

The research, titled Large Language Models as Universal Predictors? An Empirical Study on Small Tabular Datasets, was conducted by Nikolaos Pavlidis, Vasilis Perifanis, Symeon Symeonidis, and Pavlos S. Efraimidis. Their work investigates the ability of state-of-the-art LLMs, including GPT-5, GPT-4o, GPT-o3, Gemini-2.5-Flash, and DeepSeek-R1, to perform predictive tasks on small-scale tabular data for classification, regression, and clustering, leveraging their in-context learning (ICL) capabilities without explicit fine-tuning.

LLMs Excel in Classification Tasks

One of the most significant findings of the study is the strong performance of LLMs in classification tasks, particularly when data availability is limited. The models demonstrated competitive accuracy and F1-scores, often rivaling or even surpassing established machine learning (ML) baselines like Logistic Regression, Random Forest, and gradient-boosting algorithms (LightGBM, CatBoost), as well as tabular foundation models (TFMs) like TabPFN and TabICL. For instance, on the Iris dataset, many LLMs achieved accuracies above 0.96, suggesting they can serve as effective zero-training baselines for clean, low-dimensional classification problems. In the Bankrupt dataset, LLMs like GPT-5 and DeepSeek even achieved perfect accuracy, matching the best traditional models.

Struggles with Regression and Clustering

In stark contrast to their classification prowess, LLMs showed significant limitations in regression tasks, which involve predicting continuous-valued outputs. The study revealed a substantial performance gap, with LLMs often producing much higher error rates (Mean Absolute Error, Mean Squared Error) and significantly lower R2 scores (some even negative, indicating performance worse than a simple mean predictor) compared to traditional regression algorithms. This suggests that LLMs, in their current form, struggle with the numerical precision required for continuous value prediction, a challenge likely stemming from their autoregressive token generation process and lack of an explicit objective function tied to regression error.

Similarly, LLMs exhibited mixed and generally poor performance in clustering tasks, which involve grouping similar data points without predefined labels. Standard clustering algorithms consistently outperformed LLM-derived embeddings. While some LLM configurations showed moderate success on simpler datasets like Mall, this did not generalize to more complex scenarios like Wholesale or Moon, where LLMs often produced negative silhouette scores, indicating poor cluster separation. The researchers attribute this to LLM representations being optimized for general semantic similarity rather than the specific variance-covariance structures that clustering algorithms exploit.

Insights from Ablation Studies

The study also included ablation experiments to understand factors influencing LLM performance. Varying the fraction of training data showed that while LLMs can achieve high performance with few examples, more data did not always lead to consistent improvements, sometimes even causing performance degradation due to potential overfitting to noisy patterns. Tabular Foundation Models, particularly TabPFN, demonstrated greater stability across different data sizes. Additionally, the choice of serialization format for tabular data within the prompt was found to influence LLM performance, with some models like DeepSeek showing better results when data was presented in a structured JSON format compared to CSV or key:value pairs.

Also Read:

Conclusion: A Promising Start with Clear Limitations

The research concludes that while LLMs hold considerable promise as function approximators for classification tasks on tabular data, offering a rapid and low-overhead alternative to traditional ML pipelines, their current capabilities are not universal. They demonstrate clear strengths in discrete prediction problems, making them valuable for business intelligence and exploratory analytics where quick insights are needed. However, significant limitations remain in handling the numerical precision required for regression and the unsupervised pattern recognition for clustering. Further research is essential to address these shortcomings and unlock the full potential of LLMs across all types of structured data tasks.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating LLMs as General Predictors for Small Tabular Datasets

LLMs Excel in Classification Tasks

Struggles with Regression and Clustering

Insights from Ablation Studies

Conclusion: A Promising Start with Clear Limitations

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Leading Foreign Automakers Secure China’s Nod for In-Car AI Chatbots

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates