Unpacking AI's Spatial Reasoning: How Multimodal LLMs Localize Disease in Chest X-rays

TLDR: A new study evaluated Multimodal Large Language Models (MLLMs) like GPT-4, GPT-5, and MedGemma on their ability to localize pathologies in chest radiographs using a grid-based prompting method. While GPT-5 performed best among MLLMs (49.7% hit rate), all models lagged behind a CNN baseline and radiologists. GPT-5’s errors were mostly anatomically plausible, indicating a better spatial understanding than GPT-4 or MedGemma. The research highlights MLLMs’ current limitations in fine-grained spatial reasoning but also shows promising progress, suggesting a future where general-purpose MLLMs could be integrated with task-specific tools for clinical use.

The integration of Artificial Intelligence (AI) into healthcare, particularly in medical imaging, is a rapidly evolving field. While large language models (LLMs) and their multimodal counterparts (MLLMs) have shown impressive capabilities in diagnostic tasks and medical quizzes, a recent study delves into a crucial, yet often overlooked, aspect: their ability to precisely localize pathological findings in medical images.

A new research paper, titled “Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs,” explores how well leading MLLMs can pinpoint abnormalities on chest X-rays. This goes beyond simply identifying a disease; it assesses the models’ spatial understanding of human anatomy and disease presentation, which is vital for clinical practice and medical education.

The Challenge of Localization

In medical image interpretation, diagnosis is typically a classification task (e.g., identifying a disease), while localization involves pinpointing the exact area of an abnormality. For AI systems, these are distinct challenges. An AI might be excellent at classifying a condition but struggle to show *where* it is located. This study addresses this gap by systematically evaluating the localization capabilities of three prominent MLLMs: GPT-4, GPT-5 (general-purpose models from OpenAI), and MedGemma (a domain-specific model).

How the Study Was Conducted

The researchers developed a clever prompting technique. They overlaid a standardized 8×8 grid onto chest radiograph images from the CheXlocalize dataset. The MLLMs were then prompted to identify the single grid cell where a confirmed pathology was most prominent. This method allowed for coordinate-based predictions, making it possible to evaluate spatial accuracy.

Performance was measured using a ‘hit rate’ criterion: a prediction was considered a ‘hit’ if at least 50% of the predicted grid cell overlapped with the ground truth annotation provided by expert radiologists. The MLLMs’ performance was benchmarked against both a task-specific Convolutional Neural Network (CNN) baseline and an expert radiologist benchmark.

Key Findings: A Mixed Bag of Performance

Among the MLLMs, GPT-5 emerged with the highest average hit rate across nine pathologies, achieving 49.7%. GPT-4 followed with 39.1%, while MedGemma showed the lowest performance at 17.7%. However, all MLLMs underperformed both the CNN baseline (59.9%) and the human radiologist benchmark (80.1%). MedGemma’s performance was only marginally better than random chance.

Interestingly, GPT-4 and GPT-5 actually *outperformed* the CNN baseline on specific pathologies like enlarged cardiomediastinum and cardiomegaly. These conditions tend to appear in consistent anatomical locations, suggesting MLLMs might excel when the spatial variability of a finding is low. GPT-5 also showed significant improvements over GPT-4 on pathologies with more variable spatial positions, such as atelectasis, consolidation, edema, and pleural effusion.

Understanding the Errors

To gain deeper insights, the researchers categorized the models’ errors. They found that GPT-5’s misses were largely ‘partial hits’ (some overlap) or ‘position errors’ (predicted cell was anatomically plausible but not precise). Only 6.3% of GPT-5’s predictions, on average, were ‘anatomy errors’ – meaning the prediction was in an anatomically implausible region. GPT-4 had a higher rate of anatomy errors (18.0%), and MedGemma exhibited the most (29.9%). This indicates that while MLLMs might not always be precise, GPT-5 generally demonstrated a better underlying understanding of chest anatomy.

Visualizations of prediction heatmaps further illustrated these trends. For instance, GPT-4 consistently predicted central grid cells for edema, often overlaying the heart, even though edema can vary more widely. GPT-5’s predictions for edema were more distributed over the lungs, aligning better with ground truth. However, even GPT-5 made dramatic errors, such as predicting the shoulder as the location of a pneumothorax (collapsed lung) in some cases.

Also Read:

Implications for AI in Medicine

The study highlights that while MLLMs are powerful for diagnostic tasks, they currently struggle with the fine-grained spatial reasoning required for precise pathology localization. The improvements seen in GPT-5 compared to GPT-4 are encouraging, suggesting that general-purpose MLLMs are a promising direction, potentially outperforming domain-specific models like MedGemma in generalization to novel tasks.

The authors suggest that for reliable clinical use, the best strategy might involve ‘agentic strategies’ that combine the flexibility of LLMs with task-specific tools. This research underscores the critical need for continued systematic evaluation of foundation models as AI becomes increasingly integrated into clinical settings, ensuring their safe and effective application.

For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking AI’s Spatial Reasoning: How Multimodal LLMs Localize Disease in Chest X-rays

The Challenge of Localization

How the Study Was Conducted

Key Findings: A Mixed Bag of Performance

Understanding the Errors

Implications for AI in Medicine

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates