TLDR: The MasonNLP system, participating in MEDIQA-WV 2025, developed a method for medical visual question answering (MedVQA) using a general-purpose large language model (LLM) augmented with a lightweight retrieval-augmented generation (RAG) framework. This approach incorporates relevant textual and visual examples from a dataset to improve the accuracy, reasoning, and structure of responses to wound-care questions based on images and patient queries, achieving a 3rd place ranking without extensive domain-specific training.
Medical Visual Question Answering, or MedVQA, is an exciting field that allows healthcare professionals and patients to ask natural language questions about medical images. Imagine being able to ask an AI system about a wound image and getting a detailed, accurate response. This technology holds immense potential for improving clinical decision-making, supporting training, and making healthcare insights more accessible.
However, MedVQA comes with its own set of challenges. Unlike general image questioning, medical images often contain subtle features that require precise interpretation. Questions frequently demand specialized medical knowledge and logical inference. Traditional methods often rely on extensive fine-tuning or large, domain-specific datasets, which can be resource-intensive and limit scalability.
The MEDIQA-WV 2025 Challenge: Wound Care VQA
The MEDIQA-WV 2025 shared task focused specifically on wound-care VQA. The goal was to develop systems that could generate both free-text responses and structured wound attributes (like wound type, thickness, and infection status) from patient queries and associated images. This dual requirement is crucial for both patient-facing guidance and for integrating data into electronic health records.
MasonNLP’s Innovative Approach: Lightweight RAG with General-Purpose LLMs
A team from George Mason University, MasonNLP, presented a highly effective system for this challenge. Their approach centered on using a general-domain, instruction-tuned large language model (specifically, Meta LLaMA-4 Scout 17B) within a Retrieval-Augmented Generation (RAG) framework. What makes this particularly noteworthy is that it achieved strong results without requiring extensive domain-specific training.
The core idea behind RAG is to “ground” the language model’s outputs in relevant examples. Instead of relying solely on its pre-trained knowledge, the system retrieves similar textual and visual examples from a dataset at the time of inference (when it’s generating an answer). These examples are then incorporated into the prompt, guiding the LLM to produce more accurate, contextually relevant, and structured responses. This “lightweight” RAG setup is minimal, adding a few relevant examples via simple indexing and fusion, without complex re-ranking or extra training.
Why RAG is a Game-Changer for MedVQA
- It allowed a general-domain LLM to handle complex multimodal clinical tasks effectively, bypassing the need for costly and time-consuming domain-specific training.
- Retrieving examples during inference improved the model’s reasoning capabilities and made its outputs more interpretable, as they were grounded in real-world clinical data.
- It helped reduce “hallucinations” (where the AI generates factually incorrect information) and ensured better adherence to required output schemas, such as the structured wound attributes.
How the System Works
The MasonNLP system utilized the LLaMA-4 Scout 17B model. For the RAG component, they built two indices using FAISS: one for semantic text embeddings and another for vision-language embeddings (from CLIP). At inference, the system would retrieve the top two most similar training examples based on a combined text and image similarity score. These retrieved examples, including both images and text, were then added to the prompt given to the LLaMA-4 model.
The team explored different prompting strategies: zero-shot (no examples), few-shot (a couple of pre-selected examples), and RAG (retrieved examples). Their ablation study clearly showed that retrieval-augmented prompting, especially with both image and text retrieval, significantly outperformed the other methods across various evaluation metrics, including dBLEU, ROUGE, BERTScore, and assessments by other large multimodal language models like DeepSeek-V3, Gemini-1.5-pro, and GPT-4o.
Also Read:
- Keeping Medical AI Up-to-Date: A New Framework for Precise Knowledge Editing in LLMs
- A New Framework for Reliable Biomedical Question Answering
Key Findings and Implications
The MasonNLP system ranked 3rd among 19 teams and 51 submissions in the MEDIQA-WV 2025 shared task, achieving an average score of 41.37%. This competitive performance highlights the robustness of their approach. The study demonstrated a clear progression in performance: zero-shot prompting yielded very low scores, few-shot improved formatting but lacked clinical detail, and RAG with textual exemplars significantly boosted specificity and structure. Adding image retrieval further enhanced contextual grounding, particularly for wound-site descriptions and infection cues.
In essence, this research shows that combining powerful general-purpose large language models with a simple, lightweight retrieval-augmented generation framework can create transparent, flexible, and efficient solutions for complex clinical natural language processing and multimodal AI tasks. It shifts AI from generic advice to more specific, schema-consistent, and less hallucinatory answers, making it a promising direction for future advancements in healthcare AI.
For more technical details, you can refer to the full research paper: MasonNLP at MEDIQA-WV 2025: Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA.


