TLDR: This research paper proposes a framework for achieving superforecaster-level event prediction using Large Language Models (LLMs) through massive training. It addresses key challenges like data noisiness, knowledge cut-off, and simple reward structures, offering solutions such as hypothetical Bayesian networks, counterfactual events, and auxiliary rewards. The paper also advocates for expanding training data beyond traditional prediction markets to include public and web-crawled datasets. It discusses the significant societal impacts of advanced forecasting AI, including expanded predictive scope and integration into AI agents, while also highlighting crucial challenges like ensuring reliability and mitigating risks such as self-fulfilling prophecies and model bias.
The ability to predict future events, from economic shifts to technological breakthroughs, holds immense value for individuals and society. Traditionally, this has been the domain of human experts, known as superforecasters, or collective intelligence gathered through prediction markets. However, a new research paper explores how Large Language Models (LLMs) are rapidly advancing in this complex field, proposing a path to achieve superforecaster-level performance through massive training.
Initially, there was significant optimism about LLMs’ forecasting capabilities, with some early studies suggesting they were nearing human expert levels. However, these reports faced criticism due to methodological flaws, such as using small data samples, including information that LLMs had already memorized (knowledge cut-off issues), and data contamination from future events. These issues led to skepticism within the forecasting community.
Despite these early setbacks, recent advancements in LLM technology are painting a more positive picture. Newer models like GPT-4o and Claude-3.5-Sonnet are showing steady improvements, narrowing the gap with top human forecasters. Reinforcement learning (RL) has also demonstrated its ability to enhance forecasting accuracy. Furthermore, the emergence of advanced reasoning models with tool-use capabilities, often referred to as ‘Deep Research’ models, suggests that the underlying technology for significant performance gains is already in place.
Based on these promising trends, the paper argues that the time is ripe for large-scale training of LLMs specifically for event forecasting. This involves tackling unique challenges in training methodologies and expanding the scope of data acquisition.
Overcoming Training Hurdles
Training LLMs for event forecasting presents distinct difficulties. One major challenge is the ‘noisiness and sparsity’ of event outcomes. Unlike a simple classification task, future events are inherently uncertain, and similar past events for training can be rare. To address this, the paper suggests using a ‘hypothetical event Bayesian network’ to model these uncertainties. It also proposes using various reward signals during training, including the actual outcome of an event, market predictions from platforms like Polymarket, or even intermediate predictions made by the model itself at later time points, which can act as a more refined signal than a binary outcome.
Another significant hurdle is the ‘knowledge cut-off problem.’ LLMs are trained on vast amounts of data up to a certain date. If a forecasting question relates to an event that occurred before this cut-off, the model might simply recall the answer rather than performing genuine search and reasoning. This limits the usable training data. Solutions include training on events that LLMs don’t easily memorize, such as comparative outcomes between two items (e.g., which of two research ideas performed better). The paper also introduces the concept of ‘counterfactual events,’ where models are trained on scenarios with outcomes opposite to what actually happened, forcing them to reason based on retrieved information rather than memorization.
Finally, the ‘simple reward structure problem’ arises because LLMs can sometimes achieve high rewards by making extreme predictions (0% or 100% certainty) without truly understanding the underlying reasoning. To counter this, the paper advocates for ‘auxiliary reward signals.’ This could involve evaluating the quality of the model’s reasoning process itself or asking the model to predict related ‘subquestions’ that share causal factors with the main event, ensuring a more coherent and robust understanding.
Expanding Data Horizons
To enable large-scale training, the paper emphasizes the need for more diverse and extensive datasets. While prediction markets have been a primary source, the authors propose aggressively utilizing three main categories of data:
-
Market Datasets: Data from prediction markets like Polymarket and Metaculus. The paper notes a trend towards using larger volumes of this data, even with relaxed quality filters, suggesting that quantity can sometimes outweigh strict quality criteria for performance improvement.
-
Public Datasets: Structured data from public databases, such as economic indicators (GDP, FRED, DBnomics), geopolitical conflict data (ACLED), or health statistics (WHO, CDC). These sources offer a vast, untapped potential for training, though careful management of inter-event and temporal correlations is necessary to ensure diverse learning.
-
Crawling Datasets: Unstructured data collected and processed from the web, including Wikipedia articles, news reports, and academic papers (e.g., arXiv). The challenge here lies in automatically generating high-quality questions and answers from these sources and ensuring the reliability of automated pipelines.
The paper also highlights that these large-scale data collection methods can significantly improve dynamic benchmarks, allowing for faster and more accurate evaluation of forecasting models.
Also Read:
- Navigating the AI Frontier: Large Language Models in Social Simulation
- Bridging Intent and Action: Integrating Reasoning and Action Models for Automated Service Composition
Societal Implications and Future Outlook
The advancement of event forecasting AI holds profound societal implications. It could vastly expand the number of questions that can be answered, including personalized or private queries unsuitable for public markets. AI could also tackle questions without clearly defined resolution conditions by breaking them down into measurable subquestions or providing estimates based on its learned predictive capabilities.
Furthermore, integrating predictive intelligence into general AI agents could transform fields like scientific discovery, allowing AI scientists to evaluate experimental success likelihoods before allocating resources. This could lead to more principled probabilistic reasoning in AI systems, moving beyond deterministic logic.
However, the paper also addresses critical challenges and risks. Ensuring the reliability of AI predictions and effectively communicating this reliability to users is paramount. Users need interfaces that allow them to assess the AI’s performance history and compare its insights with their own. Potential risks include ‘self-fulfilling prophecies,’ where an AI’s prediction influences events to make itself true (e.g., predicting a recession causing one), malicious attacks designed to manipulate AI predictions, excessive user confidence in inaccurate forecasts, and the amplification of existing biases present in training data.
In conclusion, this research paper presents a compelling argument for investing in large-scale training of LLMs for event forecasting. By addressing unique training challenges and leveraging vast, diverse datasets, AI could soon reach superforecaster-level performance, offering unprecedented predictive intelligence to society. The full paper can be accessed here: Advancing Event Forecasting through Massive Training of Large Language Models.


