spot_img
HomeResearch & DevelopmentUnpacking AI's Spatial Reasoning: How Multimodal LLMs Localize Disease...

Unpacking AI’s Spatial Reasoning: How Multimodal LLMs Localize Disease in Chest X-rays

TLDR: A new study evaluated Multimodal Large Language Models (MLLMs) like GPT-4, GPT-5, and MedGemma on their ability to localize pathologies in chest radiographs using a grid-based prompting method. While GPT-5 performed best among MLLMs (49.7% hit rate), all models lagged behind a CNN baseline and radiologists. GPT-5’s errors were mostly anatomically plausible, indicating a better spatial understanding than GPT-4 or MedGemma. The research highlights MLLMs’ current limitations in fine-grained spatial reasoning but also shows promising progress, suggesting a future where general-purpose MLLMs could be integrated with task-specific tools for clinical use.

The integration of Artificial Intelligence (AI) into healthcare, particularly in medical imaging, is a rapidly evolving field. While large language models (LLMs) and their multimodal counterparts (MLLMs) have shown impressive capabilities in diagnostic tasks and medical quizzes, a recent study delves into a crucial, yet often overlooked, aspect: their ability to precisely localize pathological findings in medical images.

A new research paper, titled “Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs,” explores how well leading MLLMs can pinpoint abnormalities on chest X-rays. This goes beyond simply identifying a disease; it assesses the models’ spatial understanding of human anatomy and disease presentation, which is vital for clinical practice and medical education.

The Challenge of Localization

In medical image interpretation, diagnosis is typically a classification task (e.g., identifying a disease), while localization involves pinpointing the exact area of an abnormality. For AI systems, these are distinct challenges. An AI might be excellent at classifying a condition but struggle to show *where* it is located. This study addresses this gap by systematically evaluating the localization capabilities of three prominent MLLMs: GPT-4, GPT-5 (general-purpose models from OpenAI), and MedGemma (a domain-specific model).

How the Study Was Conducted

The researchers developed a clever prompting technique. They overlaid a standardized 8×8 grid onto chest radiograph images from the CheXlocalize dataset. The MLLMs were then prompted to identify the single grid cell where a confirmed pathology was most prominent. This method allowed for coordinate-based predictions, making it possible to evaluate spatial accuracy.

Performance was measured using a ‘hit rate’ criterion: a prediction was considered a ‘hit’ if at least 50% of the predicted grid cell overlapped with the ground truth annotation provided by expert radiologists. The MLLMs’ performance was benchmarked against both a task-specific Convolutional Neural Network (CNN) baseline and an expert radiologist benchmark.

Key Findings: A Mixed Bag of Performance

Among the MLLMs, GPT-5 emerged with the highest average hit rate across nine pathologies, achieving 49.7%. GPT-4 followed with 39.1%, while MedGemma showed the lowest performance at 17.7%. However, all MLLMs underperformed both the CNN baseline (59.9%) and the human radiologist benchmark (80.1%). MedGemma’s performance was only marginally better than random chance.

Interestingly, GPT-4 and GPT-5 actually *outperformed* the CNN baseline on specific pathologies like enlarged cardiomediastinum and cardiomegaly. These conditions tend to appear in consistent anatomical locations, suggesting MLLMs might excel when the spatial variability of a finding is low. GPT-5 also showed significant improvements over GPT-4 on pathologies with more variable spatial positions, such as atelectasis, consolidation, edema, and pleural effusion.

Understanding the Errors

To gain deeper insights, the researchers categorized the models’ errors. They found that GPT-5’s misses were largely ‘partial hits’ (some overlap) or ‘position errors’ (predicted cell was anatomically plausible but not precise). Only 6.3% of GPT-5’s predictions, on average, were ‘anatomy errors’ – meaning the prediction was in an anatomically implausible region. GPT-4 had a higher rate of anatomy errors (18.0%), and MedGemma exhibited the most (29.9%). This indicates that while MLLMs might not always be precise, GPT-5 generally demonstrated a better underlying understanding of chest anatomy.

Visualizations of prediction heatmaps further illustrated these trends. For instance, GPT-4 consistently predicted central grid cells for edema, often overlaying the heart, even though edema can vary more widely. GPT-5’s predictions for edema were more distributed over the lungs, aligning better with ground truth. However, even GPT-5 made dramatic errors, such as predicting the shoulder as the location of a pneumothorax (collapsed lung) in some cases.

Also Read:

Implications for AI in Medicine

The study highlights that while MLLMs are powerful for diagnostic tasks, they currently struggle with the fine-grained spatial reasoning required for precise pathology localization. The improvements seen in GPT-5 compared to GPT-4 are encouraging, suggesting that general-purpose MLLMs are a promising direction, potentially outperforming domain-specific models like MedGemma in generalization to novel tasks.

The authors suggest that for reliable clinical use, the best strategy might involve ‘agentic strategies’ that combine the flexibility of LLMs with task-specific tools. This research underscores the critical need for continued systematic evaluation of foundation models as AI becomes increasingly integrated into clinical settings, ensuring their safe and effective application.

For more details, you can read the full research paper here.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -