TLDR: M3Retrieve is a new, large-scale benchmark for evaluating multimodal retrieval models in medicine. It combines text and image data across 16 medical fields and four tasks, addressing the current lack of standardized evaluation for AI systems that need to understand both text and visuals in healthcare. The benchmark reveals that while multimodal models excel in tasks requiring both data types, text-only models still lead in purely text-based medical retrieval.
In the rapidly evolving landscape of artificial intelligence, particularly with the rise of Retrieval-Augmented Generation (RAG) systems, the ability of models to efficiently access and utilize vast amounts of information has become paramount. This is especially true in healthcare, a domain where accurate and timely information retrieval can directly impact patient care and medical research. Medical data often comes in multiple formats, combining textual descriptions with critical images like X-rays, MRIs, and histopathological slides. However, a significant challenge has been the absence of a standardized way to evaluate how well these multimodal AI models perform in real-world medical settings.
To address this crucial gap, researchers have introduced M3Retrieve, a groundbreaking Multimodal Medical Retrieval Benchmark. This new benchmark is designed to provide a comprehensive and systematic evaluation framework for AI models that need to understand and retrieve information from both text and images in medicine. M3Retrieve is a massive undertaking, spanning 5 domains and 16 distinct medical fields, encompassing over 1.2 million text documents and 164,000 multimodal queries. All data has been collected under approved licenses, ensuring ethical compliance.
The Unique Challenges of Medical Retrieval
The medical field presents several unique complexities that make information retrieval particularly challenging:
- Complex Medical Terminologies: Medical language is highly specialized and often requires AI systems to interpret intricate terms and provide plain-language explanations.
- Multiple Niche Specialties: Medicine is divided into numerous specialized disciplines, each with its own specific information needs. A benchmark must assess an AI model’s ability to generalize across these diverse areas.
- Complex Image-Text Relationships: Medical images can appear similar but represent different conditions when combined with patient history. Multimodal retrieval systems must accurately interpret these combined data points.
M3Retrieve’s Comprehensive Task Suite
Guided by consultations with healthcare professionals, M3Retrieve defines four core retrieval tasks that mirror routine information-seeking workflows in medicine:
1. Visual Context Retrieval: This task involves providing an AI model with an image and a short text or caption, and the model must retrieve the most relevant passage of text. For instance, an image of a specific anatomical structure paired with a question about its function.
2. Multimodal Summary Retrieval: Here, the AI receives a multimodal context, such as a clinical note and an associated medical image (e.g., an X-ray). Its goal is to retrieve the most relevant summary that integrates information from both modalities.
3. Multimodal Query-to-Image Retrieval: In this task, the AI is given a textual medical query or dialogue along with a visual context (a reference image). It then needs to retrieve the most visually similar or relevant image from a pool of candidates.
4. Case Study Retrieval: This involves presenting the AI with a multimodal clinical query, which could be a patient complaint or diagnostic note combined with a clinical image. The model must then retrieve the most relevant past case study from a documented set of medical cases.
Data Curation and Quality Control
The benchmark’s data is meticulously curated from various open-access sources, including Wikipedia, PubMed, open-access medical textbooks, MedPix 2.0 (a radiology teaching file), and MultiCaRe (a dataset of de-identified clinical case reports). Medical experts played a crucial role throughout the data curation process, providing feedback on data source selection, essential medical modalities, and establishing accurate relevance mappings. To validate the dataset’s reliability, a sample of queries across each task was reviewed by two doctors, yielding a Cohen’s kappa score of 0.78, indicating a high level of agreement and confirming the dataset’s accuracy.
Also Read:
- Advancing Medical AI with MedCLM: A Curriculum for Visual Reasoning and Localization
- MeDiM: A Unified Framework for Generating Medical Images and Reports
Key Findings from Model Evaluation
The researchers evaluated several state-of-the-art retrieval models, including lexicon-based (BM25), text-based dense encoders (E5, BGE, NV-Embed), CLIP-style models (MM Ret, MedImageInsight, CLIP SF), multimodal encoders (BLIP FF, MM-Embed), and late-interaction models (FLMR). The primary metric for evaluation was nNDCG@10.
The results revealed interesting trends:
- Multimodal Advantage: Models capable of integrating both text and visual information, such as MM-Embed and MedImageInsight, showed superior performance in tasks that inherently require both modalities, like Visual Context Retrieval and Query to Image Retrieval.
- Text-Centric Dominance: For tasks that are predominantly textual, such as Summary Retrieval and Case Study Retrieval, uni-modal dense retrievers like NV-Embed currently demonstrate superior performance. This suggests that while multimodal capabilities are advancing, text-only models remain highly effective for purely text-based medical information.
M3Retrieve serves as a foundational step towards a more diverse and comprehensive multimodal medical retrieval benchmark. It highlights the strengths and weaknesses of current AI models in complex medical scenarios, underscoring the need for further research and development of specialized multimodal retrieval systems for healthcare applications. The dataset and baseline code are openly available on GitHub, fostering collaborative innovation in this critical area. You can read the full research paper here: M3Retrieve: Benchmarking Multimodal Retrieval for Medicine.


