New Benchmark Reveals LLMs Struggle with Real-World Causal Reasoning

TLDR: A new benchmark, built from scientifically validated causal relationships in economics and finance journals, reveals significant limitations in state-of-the-art LLMs’ causal reasoning abilities. Models achieved low accuracy (best at 57.6%), with larger models not consistently outperforming smaller ones. While allowing access to pre-trained knowledge improved performance on simple tasks, complex reasoning still proved challenging, underscoring a critical gap in LLMs’ genuine causal understanding for high-stakes applications.

Large Language Models (LLMs) are increasingly integrated into high-stakes applications, from healthcare to finance. However, a fundamental question remains: can these models truly grasp cause-and-effect relationships, or do they merely excel at pattern matching? A recent study introduces a groundbreaking benchmark designed to rigorously assess LLMs’ causal reasoning abilities, drawing on relationships scientifically validated in top-tier economics and finance journals.

Addressing Limitations in Current LLM Evaluation

Existing benchmarks for evaluating LLMs’ causal reasoning often suffer from critical flaws, such as relying on synthetic data, covering narrow domains, or oversimplifying the concept of causality. These shortcomings make it difficult to ascertain whether LLMs are engaging in genuine causal inference or simply reproducing patterns observed during their extensive training. To bridge this gap, researchers have developed a novel benchmark rooted in real-world causal relationships, verified through stringent scientific methodologies like instrumental variables, difference-in-differences, and regression discontinuity designs.

Constructing a Scientifically Validated Benchmark

The benchmark’s creation involved a systematic process of extracting causal relationships from nearly 15,000 papers published between 2000 and 2025 across eight leading economics and finance journals. The team utilized GPT-5-mini to extract candidate causal triplets (cause, direction, effect) from paper abstracts, repeating the extraction five times for each paper. A robust consensus-based filtering mechanism was then applied, retaining only those relationships that appeared consistently across at least four out of five extractions and were supported by rigorous identification strategies. This meticulous process yielded 11,869 validated causal relationships.

To ensure the quality and reliability of these extracted relations, a systematic human assessment was conducted. Two independent annotators reviewed a random sample of 104 causal relations, evaluating the correct identification of entities (cause and effect), the accuracy of the causal direction, and whether the relation represented a genuine causal link rather than a spurious correlation. The high level of agreement among annotators underscored the robustness of the extraction pipeline.

The benchmark’s scope extends beyond traditional economics, encompassing a diverse range of topics including health, environmental economics, technological change, legal systems, and cultural economics, reflecting the interdisciplinary nature of real-world causal inference questions.

Diverse Tasks for Comprehensive Causal Reasoning Evaluation

To thoroughly probe LLMs’ multi-faceted causal reasoning capabilities, the benchmark incorporates five distinct question types, each designed to test different aspects of understanding:

Type 1: Causal Relation Identification (X-Y): This fundamental task assesses whether a given cause-effect triplet is true, evaluating the most basic ability to identify valid causal claims.
Type 2: Effect Variation (X-manyY): This type evaluates understanding of causal spillover effects, asking if a given cause leads to a different specified effect.
Type 3: Cause Variation (manyX-Y): Designed to test understanding of multiple causality, this task asks if a different specified cause leads to the same effect.
Type 4: Context-based Causal Inference (X-Y, X’-Y’): This task requires contextual causal reasoning and multi-hop inference, assessing whether a second causal claim is valid given a related first claim from the same research context.
Type 5: Causal Direction Identification (X-Y-direction): Unlike the binary true/false questions, this type asks models to predict the direction (increase, decrease, or none) of a causal effect between two given variables, crucial for practical decision-making.

After an initial filtering process to remove trivially easy questions that could be solved by simple pattern matching, the final benchmark comprises 40,379 evaluation items.

LLMs Show Significant Limitations in Causal Understanding

The evaluation involved eight state-of-the-art LLMs, including prominent reasoning models like GPT-5 and DeepSeek-R1-0528, as well as non-reasoning models such as Llama-3.3-70B and Qwen3-32B. The results revealed substantial limitations across all models in causal reasoning within economics and finance domains. The best-performing model, Qwen3-32B, achieved an overall accuracy of only 57.6%, highlighting significant room for improvement.

Notably, the study found that model scale or recency did not consistently translate to superior performance. For instance, GPT-5 recorded the lowest accuracy at 29.4%, comparable to GPT-5-mini (35.2%). All models struggled with accurately identifying causal relationships even in the fundamental Type 1 (X-Y) task, averaging only 45.2% accuracy. Performance further degraded on tasks requiring more complex reasoning, such as Type 3 (manyX-Y), which showed the poorest average accuracy of 32.5%.

Task-wise analysis indicated that performance dropped as reasoning became more compositional or required finer discrimination, particularly for Type 4 (context-based inference) and Type 5 (directional inference). The study also observed differentiated performance across various economic subfields (JEL categories), suggesting that LLMs exhibit non-uniform expertise, performing better in fields characterized by qualitative reasoning and theoretical discourse.

The Role of Prior Knowledge: Open-Book vs. Closed-Book Settings

To investigate why large-scale models like GPT-5 performed unexpectedly poorly, an ablation study was conducted. This experiment compared a ‘closed-book’ setting (where models were explicitly instructed to ignore external knowledge) with an ‘open-book’ setting (where models could leverage their pre-trained domain knowledge). The results showed that allowing open-book reasoning significantly improved performance, with average accuracy rising by over 12 percentage points. This improvement was most dramatic for Type 1 tasks, suggesting that models had internalized many literature-derived causal patterns during pre-training.

However, for tasks demanding more complex reasoning, such as those involving effect variation, cause variation, or contextual inference, the performance gains from prior knowledge were more modest. This indicates that while pre-trained knowledge can help clarify relationships, it does not fully compensate for the ability to reason faithfully within the problem’s specific context, especially when tasks require transforming or combining information.

Also Read:

Implications for Reliable AI Deployment

This research underscores a critical gap between the current capabilities of LLMs and the requirements for reliable causal reasoning in high-stakes applications. The findings emphasize the imperative need for further advancements to enhance LLMs’ genuine causal understanding beyond mere pattern matching. Addressing this capability gap is crucial for the responsible and effective deployment of AI in fields where understanding cause and effect is paramount. For a deeper dive into the methodology and results, the full research paper can be accessed here: Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

New Benchmark Reveals LLMs Struggle with Real-World Causal Reasoning

Addressing Limitations in Current LLM Evaluation

Constructing a Scientifically Validated Benchmark

Diverse Tasks for Comprehensive Causal Reasoning Evaluation

LLMs Show Significant Limitations in Causal Understanding

The Role of Prior Knowledge: Open-Book vs. Closed-Book Settings

Implications for Reliable AI Deployment

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates