Large Language Models Show Progress in Forecasting, Still Lag Human Superforecasters

TLDR: A new study evaluates state-of-the-art large language models (LLMs) on real-world forecasting questions from Metaculus. While frontier LLMs now surpass the accuracy of the average human crowd, they still significantly underperform human “superforecasters.” The research highlights LLMs’ improvements in predicting future events but also their current limitations compared to top human experts, suggesting a path for future development in AI forecasting.

Large language models (LLMs) have shown impressive abilities across many tasks, but their capacity to forecast future events has remained less explored. A recent study delves into this area, evaluating how well state-of-the-art LLMs perform on real-world forecasting questions compared to human experts.

The research, detailed in a paper by Janna Lu, assessed 464 forecasting questions sourced from Metaculus, a popular prediction platform. The performance of frontier LLMs was then measured against both the general human crowd and a select group of human ‘superforecasters’ – individuals with a proven track record of highly accurate predictions.

The Challenge of Forecasting for AI

Forecasting presents a unique challenge for AI. Unlike static benchmarks that can be ‘saturated’ through vast training data and memorization, predicting future events requires genuine out-of-distribution generalization. This means models cannot simply recall information they’ve been trained on, as the events in question haven’t happened yet. Older LLMs from 2023, for instance, often performed no better than random chance and were frequently overconfident, which is penalized by forecasting accuracy metrics like the Brier score.

The study highlights that while newer models excel at complex tasks like competitive coding and advanced mathematics, they can still exhibit surprising blind spots, such as difficulty describing how to machine a simple part. For forecasting, LLMs must generalize from their training data to make predictions about events that occur after their knowledge cutoff, much like a human forecaster.

How the Study Was Conducted

To evaluate the LLMs, the researcher used a dataset of 334 questions from Metaculus, collected between July and September 2024, along with an additional 130 questions from October to December 2024. This timeframe ensured that all events occurred after the models’ training cutoff dates, allowing for true out-of-sample testing.

Twelve different LLMs were tested, including models from OpenAI (GPT-4o, GPT-4.1, o3, o3-pro, o4-mini), Anthropic (Claude 3.5 Sonnet, Claude 3.6 Sonnet), Qwen (Qwen3-32B, Qwen3-235B-A22B), and Deepseek (Deepseek v3, Deepseek R1). These models were fed relevant news articles summarized by another LLM (Llama 3.1-72B) to provide context without data leakage.

Each model made five predictions for each question, and the average was used to calculate the Brier score, a standard measure of prediction accuracy where a score of 0 is perfect and 0.25 is equivalent to random guessing. The study also explored different prompting methods, including a direct prompt and a ‘narrative prompt’ (where the model was asked to write a fictional script about superforecasters discussing the event).

Key Findings: LLMs Surpass the Crowd, But Not Superforecasters

The results show significant progress for LLMs in forecasting. The ‘o3’ model achieved the best Brier score among the LLMs at 0.1352. This score is better than the average human crowd forecasting score of 0.149 reported in previous research, indicating that frontier LLMs can now outperform the general public in predicting future events.

However, the gap between LLMs and human superforecasters remains substantial. The superforecasters in the study achieved an impressive Brier score of 0.0225, which is significantly lower (and thus more accurate) than even the best-performing LLM. This suggests that while LLMs are improving rapidly, they still lack some critical components of expert human judgment, such as the ability to synthesize deep domain expertise with uncertainty or to correct for their own biases.

The study also revealed interesting patterns in LLM performance across different categories. Models generally performed better on questions related to politics and governance compared to economics and business. This might be because economic questions often involve precise numbers, which LLMs tend to struggle with. Additionally, the research found that using a narrative prompt, which some believed might unlock latent knowledge, actually led to worse forecasting accuracy compared to direct prompting.

Also Read:

The Future of AI in Forecasting

Despite the current gap, the rapid improvement of LLMs suggests a promising future. The paper estimates that with continued linear improvement, LLMs could reach superforecaster levels for these types of questions before May 2027. The ability of LLMs to generalize beyond their training data and make accurate predictions about future events could be highly impactful, potentially increasing the accuracy and liquidity of forecasting sites and prediction market platforms.

Future research could explore deploying LLMs as trading bots on platforms like Polymarket or Manifold Markets to see if they are as good at betting as they are at forecasting. Further investigation into domain-specific performance variations and new post-training methods could also help close the remaining gap with human experts. As LLMs continue to evolve, forecasting benchmarks offer a valuable way to measure progress toward AI systems that can truly understand and reason about an uncertain world. You can read the full research paper here: Evaluating LLMs on Real-World Forecasting Against Human Superforecasters.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Large Language Models Show Progress in Forecasting, Still Lag Human Superforecasters

The Challenge of Forecasting for AI

How the Study Was Conducted

Key Findings: LLMs Surpass the Crowd, But Not Superforecasters

The Future of AI in Forecasting

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates