TLDR: A new research paper investigates ‘hallucinations’ in Large Language Models (LLMs) used for medical imaging, where AI generates confident but incorrect outputs. The study analyzes errors in both image-to-text (generating reports from scans) and text-to-image (creating images from prompts) tasks across various modalities like X-rays, CTs, and MRIs. It reveals common patterns of factual inconsistencies and anatomical inaccuracies, demonstrating how LLMs can misinterpret medical images or generate clinically implausible content. The findings highlight critical vulnerabilities in even advanced models, emphasizing the urgent need for improved reliability and safeguards to ensure patient safety and trust in AI-driven healthcare.
Large Language Models (LLMs) are rapidly transforming various fields, and their application in medical imaging is no exception. From interpreting complex scans like X-rays, CTs, and MRIs to generating synthetic medical images for training, LLMs hold immense promise for enhancing diagnostic efficiency and reducing clinician workload. However, a critical challenge remains: hallucinations. These are outputs that appear confident and fluent but are factually incorrect or unsupported by the input data, posing significant risks in high-stakes healthcare environments.
Understanding Hallucinations in Medical Imaging
This research paper, titled “Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities”, delves deep into the phenomenon of hallucinations in LLMs applied to medical imaging. The study examines hallucinations in two primary directions:
-
Image-to-Text: Where LLMs generate descriptive reports from medical scans.
-
Text-to-Image: Where models create medical images based on clinical prompts.
The authors, Anindya Bijoy Das, Shahnewaz Karim Sakib, and Shibbir Ahmed, highlight that these errors can manifest as factual inconsistencies, anatomical inaccuracies, or even the fabrication of plausible-sounding but incorrect diagnoses. For instance, an LLM misestimating a ‘midline shift’ in a brain MRI – a critical indicator of intracranial pressure – could lead to delayed or incorrect treatment. Similarly, if an LLM generates an inaccurate image of a specific condition, it could misguide medical education or diagnostic model training.
Real-World Scenarios and Model Behavior
The paper presents compelling examples to illustrate these risks. In one case, an LLM tasked with interpreting a chest X-ray might fail to accurately detect or describe ‘pleural effusion’ (fluid accumulation around the lungs), leading to a misleading report. On the generative side, when prompted to create a chest X-ray showing ‘toe fractures’ – an anatomically impossible request for a chest X-ray – some models like GPT-4o hallucinated by overlaying fractured finger bones on a chest X-ray. In contrast, Gemini-2.5 Flash handled this more responsibly by producing separate, anatomically correct images for a chest X-ray and a foot X-ray, along with a clarification.
Types of Hallucinations Explored
The study systematically investigates various forms of hallucinations:
In Image Interpretation:
LLMs were tested on tasks like classifying brain MRIs into tumor types or identifying lung cancer in chest CTs. Even with ‘few-shot’ learning (providing a small set of examples), hallucinations persisted, with models either falsely identifying conditions or missing them entirely. Detecting specific clinical events, such as ‘ascites’ (fluid in the abdominal cavity) in CT scans, also proved challenging for LLMs, especially in subtle cases, leading to missed detections or misinterpretations.
In Image Generation:
Generative models often introduce ‘unprompted and irrelevant visual elements’. For example, when asked to generate a chest X-ray with pleural effusion without specifying the side, both GPT and Gemini produced images with right-sided effusions, introducing an unintended bias. Similarly, surgical clips or staples appeared in images of post-surgical patients even when not explicitly prompted, potentially distracting from the primary pathology.
More critically, models generated ‘clinically implausible content’. An example includes an LLM overlaying a ‘radioulnar joint’ (from the forearm) onto an abdominal CT scan, an anatomically impossible combination. The study also found that subtle changes in prompt wording, such as adding a justification like “for research purposes,” could bypass model safeguards, leading to the generation of anatomically incorrect images, like brain-like structures in an abdominal ultrasound.
Also Read:
- Assessing Synthetic Chest X-rays: A Radiologist’s Perspective on GANs and Diffusion Models
- Unmasking Vulnerabilities: Adversarial Attacks Threaten AI Medical Questionnaire Systems
Numerical Insights and Future Directions
The numerical experiments confirmed these vulnerabilities. In pleural effusion detection, models like LLaVA, Gemma, and Qwen exhibited significant false negatives (missing effusions) and false positives (hallucinating their presence). For implausible content generation, GPT-4o showed a high success rate (up to 94%) in generating clinically implausible images when a justification was provided, highlighting its sensitivity to prompt variations.
This comprehensive evaluation underscores that even advanced LLMs are prone to critical vulnerabilities in medical imaging tasks. The findings emphasize the urgent need for greater reliability in clinical settings. Future work, as suggested by the authors, should focus on improving prompt robustness, developing medically grounded decoding strategies, and implementing rigorous validation processes. Efforts in hallucination detection, specialized fine-tuning, and constraint-based generation will be crucial to ensure the safety and trustworthiness of AI-driven medical imaging systems. For more details, you can read the full paper here.


