TLDR: MedMKEB is the first comprehensive benchmark for evaluating how well medical multimodal large language models (MLLMs) can update their knowledge. It assesses five key areas: reliability, locality, generality, portability, and robustness, using a high-quality medical visual question-answering dataset. Experiments reveal that current knowledge editing methods have significant limitations in the medical domain, highlighting the need for specialized strategies to ensure these AI models can accurately and safely adapt to new medical information.
In the rapidly evolving field of medical artificial intelligence, Multimodal Large Language Models (MLLMs) are becoming indispensable tools. These advanced AI systems can understand and process both visual information, like medical images, and textual data, such as patient records or clinical guidelines. This capability allows them to assist with complex tasks like answering clinical questions and interpreting medical images, significantly enhancing medical AI applications.
However, medical knowledge is constantly changing. New discoveries, updated guidelines, and evolving best practices mean that AI models need to be able to update their understanding efficiently without being completely retrained from scratch. While the concept of ‘knowledge editing’ – modifying or updating specific information within an AI model – has been extensively studied for text-based language models, there has been a significant gap in systematic benchmarks for medical multimodal knowledge editing, which involves both images and text.
To address this critical need, researchers have introduced MedMKEB, the first comprehensive benchmark specifically designed for evaluating knowledge editing in medical MLLMs. This benchmark aims to assess how reliably, generally, locally, portably, and robustly these models can update their medical knowledge.
What MedMKEB Evaluates
MedMKEB provides a multi-dimensional evaluation framework focusing on five key aspects:
- Reliability: This measures whether the model accurately incorporates new knowledge and provides answers consistent with the updated medical facts.
- Locality: It assesses if the editing process only affects the intended knowledge, ensuring that irrelevant information or existing capabilities of the model remain unchanged.
- Generality: This evaluates the model’s ability to apply the newly edited knowledge to similar but previously unseen cases, demonstrating a true understanding rather than just memorization.
- Portability: This dimension tests whether the updated knowledge can be effectively transferred and applied to related reasoning contexts or tasks, such as multi-hop reasoning chains.
- Robustness: For the first time in medical knowledge editing, MedMKEB introduces robustness, evaluating the model’s stability and accuracy when faced with adversarial prompts or subtle alterations in questions, simulating real-world clinical interferences.
How MedMKEB Was Built
The benchmark is built upon a high-quality medical visual question-answering dataset called OmniMedVQA. Researchers meticulously constructed various editing tasks, including counterfactual corrections (changing a fact), semantic generalization (rephrasing questions or replacing images with similar ones), knowledge transfer (applying edits to related facts in a medical knowledge graph), and adversarial robustness (testing against prompt injection attacks like misleading context or symptom confusion). Crucially, all generated data underwent human expert validation to ensure its accuracy and professional relevance.
Key Findings and Challenges
Extensive experiments were conducted on several state-of-the-art general and medical MLLMs, including BLIP2-OPT, MiniGPT-4, LLaVA, Biomed-Qwen2-VL, LLaVA-Med, and HuatuoGPT-Vision, using various knowledge editing methods like fine-tuning, MEND, SERAC, KE, and IKE. The results highlighted significant limitations of existing knowledge-based editing approaches in the medical domain:
- While many algorithms showed high reliability in general models, some, like SERAC, performed poorly in medical models, indicating a struggle with the specific nuances of medical counterfactuals.
- Algorithms like MEND demonstrated better locality, meaning they could update specific knowledge without broadly affecting unrelated information.
- A major challenge across all methods was portability, especially in multi-hop reasoning scenarios, where applying edited knowledge to new, related contexts proved difficult.
- Existing methods generally showed a decline in robustness when faced with adversarial prompts, suggesting a need for stronger defenses against such attacks in medical AI.
- Medical MLLMs, often fine-tuned from general models, sometimes overfit to training data, leading to reduced generalization and weakened contextual learning abilities.
- Current editing algorithms primarily focus on language model parameters, often lacking joint optimization for visual and text modules, which is crucial for multimodal medical data.
The study also analyzed the computational cost of these editing methods. While IKE was the most efficient in terms of time and memory as it doesn’t require parameter updates, methods like SERAC and KE incurred higher costs, with KE even encountering out-of-memory errors on some models.
Also Read:
- A New Framework for Trustworthy Medical AI Evaluation
- Assessing AI’s Basic Vision in Medical Imaging: The MedBLINK Benchmark
Looking Ahead
The introduction of MedMKEB marks a significant step forward in developing trustworthy and efficient medical knowledge editing algorithms. The findings underscore the urgent need for specialized editing strategies tailored to the high precision, multimodal nature, and critical implications of medical knowledge. This benchmark will serve as a standard for future research, paving the way for safer and more adaptable AI systems in healthcare. You can read the full research paper here.


