TLDR: A new agentic Retrieval-Augmented Generation (RAG) framework significantly improves the diagnostic accuracy of large language models (LLMs) in radiology question answering. By enabling LLMs to autonomously decompose questions and iteratively retrieve targeted clinical evidence, the framework boosts performance, especially for small and mid-sized models, and reduces hallucinations. The approach also provides human-interpretable context, aiding expert radiologists.
Artificial intelligence, particularly large language models (LLMs), is increasingly valuable in radiology for tasks like interpreting images and assisting with clinical decisions. However, these models often rely on static training data, which can lead to incomplete or outdated information and sometimes generate incorrect or fabricated responses, known as hallucinations.
Traditional methods, like Retrieval-Augmented Generation (RAG), try to solve this by connecting LLMs to external knowledge sources. While RAG helps ground responses in verified information and reduces hallucinations, existing systems typically use a single-step retrieval process. This limits their effectiveness when dealing with complex medical questions that require multiple steps of reasoning and information gathering.
A new research paper, titled “Agentic large language models improve retrieval-based radiology question answering,” introduces an innovative agentic RAG framework. This framework allows LLMs to act more autonomously, breaking down complex radiology questions into smaller parts, iteratively searching for specific clinical evidence from sources like Radiopaedia.org, and then combining this evidence to create well-supported answers. You can read the full paper at this link.
The researchers, including Sebastian Wind, Jeta Sopa, Daniel Truhn, Mahshad Lotfinia, Tri-Thien Nguyen, Keno Bressem, Lisa Adams, Mirabela Rusu, Harald Köstler, Gerhard Wellein, Andreas Maier, and Soroosh Tayebi Arasteh, evaluated 24 different LLMs. These models varied widely in their architecture, size (from 0.5 billion to over 670 billion parameters), and training (general-purpose, reasoning-optimized, or clinically fine-tuned). They tested these models using 104 expert-curated radiology questions from established datasets.
The results were significant. The agentic retrieval system dramatically improved diagnostic accuracy. For instance, the average diagnostic accuracy across all LLMs increased from 64% with zero-shot prompting (no external help) and 68% with traditional RAG to 73% with the agentic framework. This shows a clear advantage of the iterative and autonomous reasoning approach.
Impact Across Model Sizes
The benefits of this agentic approach were most noticeable in mid-sized models (e.g., Mistral Large improved from 72% to 81%) and smaller models (e.g., Qwen 2.5-7B improved from 55% to 71%). These models, while capable, often struggle to independently find and use relevant external clinical information. The agentic framework helps them by providing structured, multi-step guidance.
Interestingly, very large models (over 200 billion parameters) showed minimal improvement (less than 2%). This suggests that these massive models already possess extensive internal knowledge and strong reasoning abilities from their vast pre-training, making external retrieval less impactful for accuracy alone. However, even for these models, the agentic approach could still be valuable for increasing transparency and traceability of their answers.
Reducing Hallucinations and Improving Factual Grounding
A crucial finding was the reduction in hallucinations. The agentic framework lowered the average hallucination rate to 9.4%. This means models were less likely to provide incorrect answers even when given relevant context. The system also retrieved clinically relevant context in 46% of cases, which significantly helped ground the factual accuracy of the responses.
Even clinically fine-tuned models, which are already specialized for medical applications, saw meaningful improvements. For example, MedGemma-27B improved from 71% to 81%. This indicates that agentic retrieval complements the foundational knowledge gained through fine-tuning, providing context-sensitive and up-to-date information.
Computational Considerations
While the agentic framework offers significant accuracy gains, it does come with an increased computational cost. The average response time increased from 54 seconds for zero-shot prompting to 324 seconds with agentic inference, roughly a 6.71 times increase. This latency varies by model size and architecture, with smaller models experiencing the largest relative increases. Despite this, the response times are generally considered feasible for many clinical applications, especially non-emergent ones.
Also Read:
- AI Framework for Smarter Pre-Consultation in Healthcare
- Improving AI Explanations: CoRGI Introduces Visual Grounding to Chain-of-Thought
Supporting Human Experts
Beyond improving LLM performance, the agentic retrieval system also proved valuable as a decision-support tool for human experts. When a board-certified radiologist was given the same retrieved contextual reports as the AI system, their diagnostic accuracy significantly improved from 51% (unaided) to 68%. This demonstrates that the system successfully identifies and presents clinically meaningful information that directly aids human reasoning.
In conclusion, this research highlights the potential of agentic frameworks to enhance the accuracy, factual reliability, and interpretability of LLMs in radiology question answering. While further research is needed to optimize retrieval mechanisms and manage computational overhead, this approach represents a significant step towards more trustworthy and effective AI in clinical decision support.


