TLDR: A new study reveals that leading Large Language Models (LLMs) like GPT-4o-Mini, Claude-3.5-Sonnet, and Gemini-1.5-Pro are highly susceptible to the ‘illusion of causality.’ When presented with null-contingency scenarios, particularly in medical contexts, these models systematically inferred causal relationships where none existed. This finding suggests LLMs may reproduce causal language without true comprehension, raising concerns about their reliability in critical decision-making domains requiring accurate causal reasoning.
Large Language Models (LLMs) are increasingly integrated into various aspects of our lives, but a recent study delves into a critical question: do these advanced AI systems exhibit cognitive biases similar to humans, specifically the “illusion of causality”? This phenomenon, where individuals perceive a causal link between two events without sufficient evidence, has significant implications, from social prejudice to misinformation.
The illusion of causality is a well-documented human cognitive bias. It’s the tendency to believe that one event causes another, even when there’s no real connection. Think of superstitions, like avoiding walking under a ladder because it might bring bad luck. In more critical areas like healthcare, this bias can lead to dangerous conclusions, such as believing a pill works simply because one felt better after taking it, even if the recovery was coincidental.
Researchers from multiple institutions, including the Università degli Studi di Genova and FAIR, IALAB, Universidad de Buenos Aires UBA, investigated this bias in LLMs. Their paper, titled “Do Large Language Models Show Biases in Causal Learning? Insights from Contingency Judgment,” explores whether models like GPT-4o-Mini, Claude-3.5-Sonnet, and Gemini-1.5-Pro fall prey to this illusion. You can read the full research paper here.
To test this, the team adapted a classic cognitive science paradigm: the contingency judgment task. In this task, participants (or in this case, LLMs) are presented with scenarios involving a potential cause and an outcome. The key is to present “null-contingency” scenarios, where the probability of the outcome is the same whether the cause is present or absent. In such cases, a rational judgment would be to conclude that there is no causal relationship.
The researchers created a dataset of 1,000 null-contingency scenarios, primarily set within medical contexts. These scenarios involved various types of variables, from fabricated diseases and treatments to established drugs like Paracetamol. LLMs were prompted to act as doctors or medical researchers and evaluate the effectiveness of a potential cause on a scale from 0 (non-effective) to 100 (totally effective).
The findings were striking. All evaluated models systematically inferred unwarranted causal relationships, demonstrating a strong susceptibility to the illusion of causality. GPT-4o-Mini showed the highest degree of this illusion, with responses centered around a mean of 75.74. Claude-3.5-Sonnet and Gemini-1.5-Pro also exhibited the bias, though to varying degrees, with Gemini-1.5-Pro showing the lowest degree of causal illusion among the three, yet still significantly above a rational zero rating.
Statistically, the study found strong evidence to reject the hypothesis that any model consistently reported no causality. Furthermore, the models did not rely on a common criterion for assessing causality, with each exhibiting a distinct approach, often assigning higher values than others in pairwise comparisons.
These results raise significant concerns about the deployment of LLMs in domains where accurate causal reasoning is paramount for informed decision-making, such as healthcare, finance, or legal applications. While there’s an ongoing debate about whether LLMs truly “understand” causality or merely reproduce causal language, this research suggests the latter, as they fail to internalize contingency as a normative principle for causal inference.
The study acknowledges limitations, such as the absence of direct human baselines for comparison and the specific task design not being fully representative of real-world usage. Future work could explore different task formats, incorporate prompting techniques like chain-of-thought to guide reasoning, and test a broader range of contingency scenarios.
Also Read:
- Unpacking LLM Causal Reasoning in Climate Debates
- Phantom Recall: The Hidden Flaw in Large Language Models’ Reasoning
Ultimately, this research highlights that even advanced LLMs can inherit and amplify human cognitive biases, particularly the illusion of causality. As AI systems become more sophisticated, understanding and mitigating these biases will be crucial for ensuring their responsible and reliable application.


