Smart AI Agents Boost Accuracy in Radiology Diagnostics

TLDR: A new agentic Retrieval-Augmented Generation (RAG) framework significantly improves the diagnostic accuracy of large language models (LLMs) in radiology question answering. By enabling LLMs to autonomously decompose questions and iteratively retrieve targeted clinical evidence, the framework boosts performance, especially for small and mid-sized models, and reduces hallucinations. The approach also provides human-interpretable context, aiding expert radiologists.

Artificial intelligence, particularly large language models (LLMs), is increasingly valuable in radiology for tasks like interpreting images and assisting with clinical decisions. However, these models often rely on static training data, which can lead to incomplete or outdated information and sometimes generate incorrect or fabricated responses, known as hallucinations.

Traditional methods, like Retrieval-Augmented Generation (RAG), try to solve this by connecting LLMs to external knowledge sources. While RAG helps ground responses in verified information and reduces hallucinations, existing systems typically use a single-step retrieval process. This limits their effectiveness when dealing with complex medical questions that require multiple steps of reasoning and information gathering.

A new research paper, titled “Agentic large language models improve retrieval-based radiology question answering,” introduces an innovative agentic RAG framework. This framework allows LLMs to act more autonomously, breaking down complex radiology questions into smaller parts, iteratively searching for specific clinical evidence from sources like Radiopaedia.org, and then combining this evidence to create well-supported answers. You can read the full paper at this link.

The researchers, including Sebastian Wind, Jeta Sopa, Daniel Truhn, Mahshad Lotfinia, Tri-Thien Nguyen, Keno Bressem, Lisa Adams, Mirabela Rusu, Harald Köstler, Gerhard Wellein, Andreas Maier, and Soroosh Tayebi Arasteh, evaluated 24 different LLMs. These models varied widely in their architecture, size (from 0.5 billion to over 670 billion parameters), and training (general-purpose, reasoning-optimized, or clinically fine-tuned). They tested these models using 104 expert-curated radiology questions from established datasets.

The results were significant. The agentic retrieval system dramatically improved diagnostic accuracy. For instance, the average diagnostic accuracy across all LLMs increased from 64% with zero-shot prompting (no external help) and 68% with traditional RAG to 73% with the agentic framework. This shows a clear advantage of the iterative and autonomous reasoning approach.

Impact Across Model Sizes

The benefits of this agentic approach were most noticeable in mid-sized models (e.g., Mistral Large improved from 72% to 81%) and smaller models (e.g., Qwen 2.5-7B improved from 55% to 71%). These models, while capable, often struggle to independently find and use relevant external clinical information. The agentic framework helps them by providing structured, multi-step guidance.

Interestingly, very large models (over 200 billion parameters) showed minimal improvement (less than 2%). This suggests that these massive models already possess extensive internal knowledge and strong reasoning abilities from their vast pre-training, making external retrieval less impactful for accuracy alone. However, even for these models, the agentic approach could still be valuable for increasing transparency and traceability of their answers.

Reducing Hallucinations and Improving Factual Grounding

A crucial finding was the reduction in hallucinations. The agentic framework lowered the average hallucination rate to 9.4%. This means models were less likely to provide incorrect answers even when given relevant context. The system also retrieved clinically relevant context in 46% of cases, which significantly helped ground the factual accuracy of the responses.

Even clinically fine-tuned models, which are already specialized for medical applications, saw meaningful improvements. For example, MedGemma-27B improved from 71% to 81%. This indicates that agentic retrieval complements the foundational knowledge gained through fine-tuning, providing context-sensitive and up-to-date information.

Computational Considerations

While the agentic framework offers significant accuracy gains, it does come with an increased computational cost. The average response time increased from 54 seconds for zero-shot prompting to 324 seconds with agentic inference, roughly a 6.71 times increase. This latency varies by model size and architecture, with smaller models experiencing the largest relative increases. Despite this, the response times are generally considered feasible for many clinical applications, especially non-emergent ones.

Also Read:

Supporting Human Experts

Beyond improving LLM performance, the agentic retrieval system also proved valuable as a decision-support tool for human experts. When a board-certified radiologist was given the same retrieved contextual reports as the AI system, their diagnostic accuracy significantly improved from 51% (unaided) to 68%. This demonstrates that the system successfully identifies and presents clinically meaningful information that directly aids human reasoning.

In conclusion, this research highlights the potential of agentic frameworks to enhance the accuracy, factual reliability, and interpretability of LLMs in radiology question answering. While further research is needed to optimize retrieval mechanisms and manage computational overhead, this approach represents a significant step towards more trustworthy and effective AI in clinical decision support.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Smart AI Agents Boost Accuracy in Radiology Diagnostics

Impact Across Model Sizes

Reducing Hallucinations and Improving Factual Grounding

Computational Considerations

Supporting Human Experts

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates