TLDR: A new benchmark, Radiology’s Last Exam (RadLE) v1, evaluated frontier multimodal AI models against human radiologists and trainees on 50 challenging diagnostic cases. Board-certified radiologists achieved 83% accuracy, significantly outperforming trainees (45%) and all AI models (GPT-5, the best AI, scored 30%). The study revealed substantial human-AI performance gaps, identified a taxonomy of visual reasoning errors in AI, and cautioned against unsupervised clinical use of current generalist AI in complex medical imaging.
A recent study, titled Radiology’s Last Exam (RadLE): Benchmarking Frontier Multimodal AI Against Human Experts and a Taxonomy of Visual Reasoning Errors in Radiology, has shed light on the current capabilities of advanced artificial intelligence (AI) in interpreting complex medical images. The research, conducted by a team including Suvrankar Datta, Divya Buchireddygari, and many other contributors from the Centre for Responsible Autonomous Systems in Healthcare (CRASH) Lab at Ashoka University and independent researchers, introduces a new benchmark designed to challenge frontier AI models against the diagnostic prowess of human radiologists.
The study was motivated by the increasing use of generalist multimodal AI systems, such as large language models (LLMs) and vision language models (VLMs), by both clinicians and patients for medical image interpretation. While many reports claim expert-level performance for these AIs, most evaluations often rely on public datasets featuring common pathologies, which may not accurately reflect the complexities of real-world radiology, where subtle and challenging cases are common.
The RadLE Benchmark and Methodology
To address this gap, the researchers developed Radiology’s Last Exam (RadLE) v1, a pilot benchmark consisting of 50 expert-level “spot diagnosis” cases across various imaging modalities including radiography, CT, and MRI, and covering six major clinical systems. These cases were specifically curated to represent difficult diagnostic scenarios that differentiate novice from expert performance.
Five popular frontier AI models—OpenAI o3, OpenAI GPT-5, Gemini 2.5 Pro, Grok-4, and Claude Opus 4.1—were evaluated through their native web interfaces. GPT-5 was also assessed via its API across different reasoning effort levels (low, medium, high). The performance of these AI models was compared against board-certified radiologists and radiology trainees. Diagnostic accuracy was scored by blinded experts, and reproducibility was assessed across three independent runs for each AI model.
Key Findings: Humans Outperform AI
The results clearly demonstrated a significant performance gap between human experts and current AI models. Board-certified radiologists achieved the highest diagnostic accuracy at 83%, substantially outperforming radiology trainees, who scored 45%. All tested AI models underperformed compared to human benchmarks.
Among the AI models, GPT-5 showed the best performance with 30% accuracy, followed closely by Gemini 2.5 Pro at 29%. OpenAI o3 achieved 23% accuracy, Grok-4 reached 12%, and Claude Opus 4.1 performed poorly with only 1% accuracy. This indicates that even the most advanced frontier AI models fall far short of human radiologists in challenging diagnostic cases.
Performance varied across imaging modalities, with AI models generally performing best on MRI cases (GPT-5 at 45%) compared to CT (GPT-5 at 22%) and plain radiography (GPT-5 at 31%). However, radiologists maintained superior performance across all modalities and anatomical systems.
Reasoning Modes and Latency
Interestingly, adjusting GPT-5’s reasoning effort levels through its API (low, medium, high) yielded minimal performance differences, with accuracy ranging from 25% to 26%. This marginal gain came at a substantial computational cost, as high-effort tasks required over six times longer response times (65.6 seconds) compared to low-effort mode (10.5 seconds). This suggests that current AI “reasoning” capabilities may not effectively translate to improved diagnostic performance in medical imaging, despite increased processing time.
Consistency and Error Taxonomy
The study also assessed the consistency of AI model outputs. GPT-5 showed the strongest repeatability, indicating substantial agreement across runs, followed by OpenAI o3. Gemini 2.5 Pro and Grok-4 achieved moderate agreement, while Claude Opus 4.1 demonstrated poor reproducibility.
To understand the diagnostic failure modes, the researchers conducted a qualitative analysis of AI reasoning traces and proposed a taxonomy of visual reasoning errors. This taxonomy categorizes errors into three primary types:
- Perceptual Errors: These include under-detection (failing to identify visible findings), over-detection (identifying non-existent findings, akin to hallucination), and mislocalization (correctly identifying a pattern but attributing it to the wrong anatomical location).
- Interpretive Errors: This category covers misinterpretation or misattribution of findings (linking visual patterns incorrectly to pathophysiological processes) and incomplete reasoning or premature diagnostic closure (accepting initial impressions without considering alternatives).
- Communication Errors: This refers to findings-summary discordance, where detailed observations within the reasoning trace contradict the final diagnostic impression.
Additionally, cognitive bias patterns such as confirmation/anchoring bias, availability bias, inattentional bias, and framing effects were observed to influence AI diagnostic reasoning.
Also Read:
- Assessing Synergy in Unified AI Models: The RealUnify Benchmark
- Why Advanced AI Models Struggle with Simple Visual Tasks: The Serial Processing Gap
Implications and Future Directions
These findings highlight the present limitations of generalist AI in medical imaging and caution against unsupervised clinical use. The substantial performance gaps and the identified error patterns underscore the need for cautious deployment and expert human oversight, especially in high-stakes diagnostic scenarios. The study suggests that future development should focus on improving the detection of subtle findings, integrating imaging with clinical reasoning, enhancing response consistency, and potentially developing specialized fine-tuned models for radiological applications. The researchers also advocate for regulatory frameworks that mandate model evaluation on high-complexity cases rather than relying solely on standard datasets.


