Benchmarking AI in Radiology: A Reality Check on Diagnostic Accuracy

TLDR: A new benchmark, Radiology’s Last Exam (RadLE) v1, evaluated frontier multimodal AI models against human radiologists and trainees on 50 challenging diagnostic cases. Board-certified radiologists achieved 83% accuracy, significantly outperforming trainees (45%) and all AI models (GPT-5, the best AI, scored 30%). The study revealed substantial human-AI performance gaps, identified a taxonomy of visual reasoning errors in AI, and cautioned against unsupervised clinical use of current generalist AI in complex medical imaging.

A recent study, titled Radiology’s Last Exam (RadLE): Benchmarking Frontier Multimodal AI Against Human Experts and a Taxonomy of Visual Reasoning Errors in Radiology, has shed light on the current capabilities of advanced artificial intelligence (AI) in interpreting complex medical images. The research, conducted by a team including Suvrankar Datta, Divya Buchireddygari, and many other contributors from the Centre for Responsible Autonomous Systems in Healthcare (CRASH) Lab at Ashoka University and independent researchers, introduces a new benchmark designed to challenge frontier AI models against the diagnostic prowess of human radiologists.

The study was motivated by the increasing use of generalist multimodal AI systems, such as large language models (LLMs) and vision language models (VLMs), by both clinicians and patients for medical image interpretation. While many reports claim expert-level performance for these AIs, most evaluations often rely on public datasets featuring common pathologies, which may not accurately reflect the complexities of real-world radiology, where subtle and challenging cases are common.

The RadLE Benchmark and Methodology

To address this gap, the researchers developed Radiology’s Last Exam (RadLE) v1, a pilot benchmark consisting of 50 expert-level “spot diagnosis” cases across various imaging modalities including radiography, CT, and MRI, and covering six major clinical systems. These cases were specifically curated to represent difficult diagnostic scenarios that differentiate novice from expert performance.

Five popular frontier AI models—OpenAI o3, OpenAI GPT-5, Gemini 2.5 Pro, Grok-4, and Claude Opus 4.1—were evaluated through their native web interfaces. GPT-5 was also assessed via its API across different reasoning effort levels (low, medium, high). The performance of these AI models was compared against board-certified radiologists and radiology trainees. Diagnostic accuracy was scored by blinded experts, and reproducibility was assessed across three independent runs for each AI model.

Key Findings: Humans Outperform AI

The results clearly demonstrated a significant performance gap between human experts and current AI models. Board-certified radiologists achieved the highest diagnostic accuracy at 83%, substantially outperforming radiology trainees, who scored 45%. All tested AI models underperformed compared to human benchmarks.

Among the AI models, GPT-5 showed the best performance with 30% accuracy, followed closely by Gemini 2.5 Pro at 29%. OpenAI o3 achieved 23% accuracy, Grok-4 reached 12%, and Claude Opus 4.1 performed poorly with only 1% accuracy. This indicates that even the most advanced frontier AI models fall far short of human radiologists in challenging diagnostic cases.

Performance varied across imaging modalities, with AI models generally performing best on MRI cases (GPT-5 at 45%) compared to CT (GPT-5 at 22%) and plain radiography (GPT-5 at 31%). However, radiologists maintained superior performance across all modalities and anatomical systems.

Reasoning Modes and Latency

Interestingly, adjusting GPT-5’s reasoning effort levels through its API (low, medium, high) yielded minimal performance differences, with accuracy ranging from 25% to 26%. This marginal gain came at a substantial computational cost, as high-effort tasks required over six times longer response times (65.6 seconds) compared to low-effort mode (10.5 seconds). This suggests that current AI “reasoning” capabilities may not effectively translate to improved diagnostic performance in medical imaging, despite increased processing time.

Consistency and Error Taxonomy

The study also assessed the consistency of AI model outputs. GPT-5 showed the strongest repeatability, indicating substantial agreement across runs, followed by OpenAI o3. Gemini 2.5 Pro and Grok-4 achieved moderate agreement, while Claude Opus 4.1 demonstrated poor reproducibility.

To understand the diagnostic failure modes, the researchers conducted a qualitative analysis of AI reasoning traces and proposed a taxonomy of visual reasoning errors. This taxonomy categorizes errors into three primary types:

Perceptual Errors: These include under-detection (failing to identify visible findings), over-detection (identifying non-existent findings, akin to hallucination), and mislocalization (correctly identifying a pattern but attributing it to the wrong anatomical location).
Interpretive Errors: This category covers misinterpretation or misattribution of findings (linking visual patterns incorrectly to pathophysiological processes) and incomplete reasoning or premature diagnostic closure (accepting initial impressions without considering alternatives).
Communication Errors: This refers to findings-summary discordance, where detailed observations within the reasoning trace contradict the final diagnostic impression.

Additionally, cognitive bias patterns such as confirmation/anchoring bias, availability bias, inattentional bias, and framing effects were observed to influence AI diagnostic reasoning.

Also Read:

Implications and Future Directions

These findings highlight the present limitations of generalist AI in medical imaging and caution against unsupervised clinical use. The substantial performance gaps and the identified error patterns underscore the need for cautious deployment and expert human oversight, especially in high-stakes diagnostic scenarios. The study suggests that future development should focus on improving the detection of subtle findings, integrating imaging with clinical reasoning, enhancing response consistency, and potentially developing specialized fine-tuned models for radiological applications. The researchers also advocate for regulatory frameworks that mandate model evaluation on high-complexity cases rather than relying solely on standard datasets.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Benchmarking AI in Radiology: A Reality Check on Diagnostic Accuracy

The RadLE Benchmark and Methodology

Key Findings: Humans Outperform AI

Reasoning Modes and Latency

Consistency and Error Taxonomy

Implications and Future Directions

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates