TLDR: A scoping review of 251 studies on Retrieval-Augmented Generation (RAG) in medicine reveals that research heavily relies on public data and dense retrieval methods, often with English-centric models. Proprietary LLMs are common, but medical-specific LLMs are underutilized. Applications focus on question answering and report generation, primarily in Internal Medicine. A critical finding is the insufficient attention to ethical considerations like bias, safety, and deployment in low-resource settings, highlighting the need for clinical validation, transparency, and equitable adaptation for future implementation.
The medical field is constantly evolving, with new knowledge emerging at an unprecedented rate. This rapid expansion, coupled with the increasing complexity of patient care, presents significant challenges for healthcare professionals. Large Language Models (LLMs) have shown promise in assisting with these challenges, but they come with their own set of limitations, such as relying on static data, potential for factual inaccuracies, lack of explainability, and inability to access private patient data.
A recent scoping review, titled Retrieval-Augmented Generation in Medicine: A Scoping Review of Technical Implementations, Clinical Applications, and Ethical Considerations, delves into how Retrieval-Augmented Generation (RAG) technologies are being applied in medicine to overcome these LLM limitations. This comprehensive review analyzed 251 studies to map the implementation pathways, application patterns, and ethical considerations of RAG in healthcare.
Understanding RAG in Medicine
RAG enhances LLMs by allowing them to access and incorporate information from external knowledge sources during the generation process. This means LLMs can provide more up-to-date, relevant, and fact-grounded outputs. The review highlights that RAG systems typically follow an “index-retrieve-generate” pipeline, where relevant information is first retrieved from sources like research literature or clinical guidelines, and then used to augment the LLM’s response.
Key Findings from the Review
The review uncovered several important trends in medical RAG research:
Data Sources: Most studies (over 80%) relied on publicly available data, such as biomedical scientific corpora (e.g., PubMed), clinical guidelines, and online information. Private data, like electronic health records, saw limited use due to privacy concerns and implementation complexities. This suggests that current RAG applications primarily focus on general medical knowledge rather than personalized healthcare.
Retrieval Methods: Dense retrieval methods were dominant, used in over 84% of studies. These methods often employ general or medical-specific embedding models (like BioBERT or MedCPT) to capture semantic relationships. However, a significant limitation identified was the reliance on English-centric embedding models, which restricts RAG’s effectiveness in non-English medical contexts and can exacerbate health inequities in low-resource languages.
Generative LLMs: Proprietary LLMs, mainly from OpenAI’s GPT series, were the most widely used, followed by open-weight LLMs (like DeepSeek, Gemma, LLaMA, and Qwen series). Interestingly, medical-specific LLMs were rarely applied, possibly due to limited public accessibility or slower development compared to general LLMs.
Medical Specialties and Applications: RAG applications were most concentrated in Internal Medicine, followed by Psychiatry, Neurology, and Radiology. The primary application scenario was medical question answering, supporting clinicians in evidence retrieval, diagnostic reasoning, and decision-making. Other notable applications included report generation (e.g., radiology or pathology reports), text summarization, and information extraction, all aimed at reducing clinician workload and improving information management.
Evaluation and Ethics: Evaluation methods showed a balance between automated metrics (for text generation quality and task performance) and human evaluation (for accuracy, completeness, relevance, and fluency). Crucially, the review found insufficient attention paid to ethical considerations such as bias (examined in less than 3% of studies), safety (addressed in less than 10%), and applications in low-resource settings (less than 3%). This highlights a significant gap in ensuring equitable and responsible deployment of RAG technologies.
Also Read:
- Enhancing AI Retrieval: How Knowledge Graphs and Ontologies Boost RAG Performance
- Optimizing Retrieval-Augmented Generation for Complex Question Answering
Challenges and Future Directions
The review concludes that medical RAG is still in its early stages. To move towards real-world clinical implementation, several breakthroughs are needed. These include rigorous clinical validation to ensure factual accuracy and clinical actionability, establishing traceability and transparency mechanisms for outputs, and developing robust regulatory frameworks and ethical guidelines. Furthermore, significant progress is required in cross-linguistic and cross-cultural adaptation, as well as ensuring fairness in low-resource settings, to achieve safe, trustworthy, and responsible global use of RAG in healthcare.
The insights from this review by Rui Yang, Matthew Yu Heng Wong, Huitao Li, and their colleagues provide a critical roadmap for researchers and developers to address the current limitations and advance RAG technologies for the benefit of global health care.


