spot_img
HomeResearch & DevelopmentNew Benchmark Reveals LLMs Struggle with Real-World Causal Reasoning

New Benchmark Reveals LLMs Struggle with Real-World Causal Reasoning

TLDR: A new benchmark, built from scientifically validated causal relationships in economics and finance journals, reveals significant limitations in state-of-the-art LLMs’ causal reasoning abilities. Models achieved low accuracy (best at 57.6%), with larger models not consistently outperforming smaller ones. While allowing access to pre-trained knowledge improved performance on simple tasks, complex reasoning still proved challenging, underscoring a critical gap in LLMs’ genuine causal understanding for high-stakes applications.

Large Language Models (LLMs) are increasingly integrated into high-stakes applications, from healthcare to finance. However, a fundamental question remains: can these models truly grasp cause-and-effect relationships, or do they merely excel at pattern matching? A recent study introduces a groundbreaking benchmark designed to rigorously assess LLMs’ causal reasoning abilities, drawing on relationships scientifically validated in top-tier economics and finance journals.

Addressing Limitations in Current LLM Evaluation

Existing benchmarks for evaluating LLMs’ causal reasoning often suffer from critical flaws, such as relying on synthetic data, covering narrow domains, or oversimplifying the concept of causality. These shortcomings make it difficult to ascertain whether LLMs are engaging in genuine causal inference or simply reproducing patterns observed during their extensive training. To bridge this gap, researchers have developed a novel benchmark rooted in real-world causal relationships, verified through stringent scientific methodologies like instrumental variables, difference-in-differences, and regression discontinuity designs.

Constructing a Scientifically Validated Benchmark

The benchmark’s creation involved a systematic process of extracting causal relationships from nearly 15,000 papers published between 2000 and 2025 across eight leading economics and finance journals. The team utilized GPT-5-mini to extract candidate causal triplets (cause, direction, effect) from paper abstracts, repeating the extraction five times for each paper. A robust consensus-based filtering mechanism was then applied, retaining only those relationships that appeared consistently across at least four out of five extractions and were supported by rigorous identification strategies. This meticulous process yielded 11,869 validated causal relationships.

To ensure the quality and reliability of these extracted relations, a systematic human assessment was conducted. Two independent annotators reviewed a random sample of 104 causal relations, evaluating the correct identification of entities (cause and effect), the accuracy of the causal direction, and whether the relation represented a genuine causal link rather than a spurious correlation. The high level of agreement among annotators underscored the robustness of the extraction pipeline.

The benchmark’s scope extends beyond traditional economics, encompassing a diverse range of topics including health, environmental economics, technological change, legal systems, and cultural economics, reflecting the interdisciplinary nature of real-world causal inference questions.

Diverse Tasks for Comprehensive Causal Reasoning Evaluation

To thoroughly probe LLMs’ multi-faceted causal reasoning capabilities, the benchmark incorporates five distinct question types, each designed to test different aspects of understanding:

  • Type 1: Causal Relation Identification (X-Y): This fundamental task assesses whether a given cause-effect triplet is true, evaluating the most basic ability to identify valid causal claims.
  • Type 2: Effect Variation (X-manyY): This type evaluates understanding of causal spillover effects, asking if a given cause leads to a different specified effect.
  • Type 3: Cause Variation (manyX-Y): Designed to test understanding of multiple causality, this task asks if a different specified cause leads to the same effect.
  • Type 4: Context-based Causal Inference (X-Y, X’-Y’): This task requires contextual causal reasoning and multi-hop inference, assessing whether a second causal claim is valid given a related first claim from the same research context.
  • Type 5: Causal Direction Identification (X-Y-direction): Unlike the binary true/false questions, this type asks models to predict the direction (increase, decrease, or none) of a causal effect between two given variables, crucial for practical decision-making.

After an initial filtering process to remove trivially easy questions that could be solved by simple pattern matching, the final benchmark comprises 40,379 evaluation items.

LLMs Show Significant Limitations in Causal Understanding

The evaluation involved eight state-of-the-art LLMs, including prominent reasoning models like GPT-5 and DeepSeek-R1-0528, as well as non-reasoning models such as Llama-3.3-70B and Qwen3-32B. The results revealed substantial limitations across all models in causal reasoning within economics and finance domains. The best-performing model, Qwen3-32B, achieved an overall accuracy of only 57.6%, highlighting significant room for improvement.

Notably, the study found that model scale or recency did not consistently translate to superior performance. For instance, GPT-5 recorded the lowest accuracy at 29.4%, comparable to GPT-5-mini (35.2%). All models struggled with accurately identifying causal relationships even in the fundamental Type 1 (X-Y) task, averaging only 45.2% accuracy. Performance further degraded on tasks requiring more complex reasoning, such as Type 3 (manyX-Y), which showed the poorest average accuracy of 32.5%.

Task-wise analysis indicated that performance dropped as reasoning became more compositional or required finer discrimination, particularly for Type 4 (context-based inference) and Type 5 (directional inference). The study also observed differentiated performance across various economic subfields (JEL categories), suggesting that LLMs exhibit non-uniform expertise, performing better in fields characterized by qualitative reasoning and theoretical discourse.

The Role of Prior Knowledge: Open-Book vs. Closed-Book Settings

To investigate why large-scale models like GPT-5 performed unexpectedly poorly, an ablation study was conducted. This experiment compared a ‘closed-book’ setting (where models were explicitly instructed to ignore external knowledge) with an ‘open-book’ setting (where models could leverage their pre-trained domain knowledge). The results showed that allowing open-book reasoning significantly improved performance, with average accuracy rising by over 12 percentage points. This improvement was most dramatic for Type 1 tasks, suggesting that models had internalized many literature-derived causal patterns during pre-training.

However, for tasks demanding more complex reasoning, such as those involving effect variation, cause variation, or contextual inference, the performance gains from prior knowledge were more modest. This indicates that while pre-trained knowledge can help clarify relationships, it does not fully compensate for the ability to reason faithfully within the problem’s specific context, especially when tasks require transforming or combining information.

Also Read:

Implications for Reliable AI Deployment

This research underscores a critical gap between the current capabilities of LLMs and the requirements for reliable causal reasoning in high-stakes applications. The findings emphasize the imperative need for further advancements to enhance LLMs’ genuine causal understanding beyond mere pattern matching. Addressing this capability gap is crucial for the responsible and effective deployment of AI in fields where understanding cause and effect is paramount. For a deeper dive into the methodology and results, the full research paper can be accessed here: Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -