Enhancing Medical Visual Question Answering with Lightweight AI

TLDR: The MasonNLP system, participating in MEDIQA-WV 2025, developed a method for medical visual question answering (MedVQA) using a general-purpose large language model (LLM) augmented with a lightweight retrieval-augmented generation (RAG) framework. This approach incorporates relevant textual and visual examples from a dataset to improve the accuracy, reasoning, and structure of responses to wound-care questions based on images and patient queries, achieving a 3rd place ranking without extensive domain-specific training.

Medical Visual Question Answering, or MedVQA, is an exciting field that allows healthcare professionals and patients to ask natural language questions about medical images. Imagine being able to ask an AI system about a wound image and getting a detailed, accurate response. This technology holds immense potential for improving clinical decision-making, supporting training, and making healthcare insights more accessible.

However, MedVQA comes with its own set of challenges. Unlike general image questioning, medical images often contain subtle features that require precise interpretation. Questions frequently demand specialized medical knowledge and logical inference. Traditional methods often rely on extensive fine-tuning or large, domain-specific datasets, which can be resource-intensive and limit scalability.

The MEDIQA-WV 2025 Challenge: Wound Care VQA

The MEDIQA-WV 2025 shared task focused specifically on wound-care VQA. The goal was to develop systems that could generate both free-text responses and structured wound attributes (like wound type, thickness, and infection status) from patient queries and associated images. This dual requirement is crucial for both patient-facing guidance and for integrating data into electronic health records.

MasonNLP’s Innovative Approach: Lightweight RAG with General-Purpose LLMs

A team from George Mason University, MasonNLP, presented a highly effective system for this challenge. Their approach centered on using a general-domain, instruction-tuned large language model (specifically, Meta LLaMA-4 Scout 17B) within a Retrieval-Augmented Generation (RAG) framework. What makes this particularly noteworthy is that it achieved strong results without requiring extensive domain-specific training.

The core idea behind RAG is to “ground” the language model’s outputs in relevant examples. Instead of relying solely on its pre-trained knowledge, the system retrieves similar textual and visual examples from a dataset at the time of inference (when it’s generating an answer). These examples are then incorporated into the prompt, guiding the LLM to produce more accurate, contextually relevant, and structured responses. This “lightweight” RAG setup is minimal, adding a few relevant examples via simple indexing and fusion, without complex re-ranking or extra training.

Why RAG is a Game-Changer for MedVQA

It allowed a general-domain LLM to handle complex multimodal clinical tasks effectively, bypassing the need for costly and time-consuming domain-specific training.
Retrieving examples during inference improved the model’s reasoning capabilities and made its outputs more interpretable, as they were grounded in real-world clinical data.
It helped reduce “hallucinations” (where the AI generates factually incorrect information) and ensured better adherence to required output schemas, such as the structured wound attributes.

How the System Works

The MasonNLP system utilized the LLaMA-4 Scout 17B model. For the RAG component, they built two indices using FAISS: one for semantic text embeddings and another for vision-language embeddings (from CLIP). At inference, the system would retrieve the top two most similar training examples based on a combined text and image similarity score. These retrieved examples, including both images and text, were then added to the prompt given to the LLaMA-4 model.

The team explored different prompting strategies: zero-shot (no examples), few-shot (a couple of pre-selected examples), and RAG (retrieved examples). Their ablation study clearly showed that retrieval-augmented prompting, especially with both image and text retrieval, significantly outperformed the other methods across various evaluation metrics, including dBLEU, ROUGE, BERTScore, and assessments by other large multimodal language models like DeepSeek-V3, Gemini-1.5-pro, and GPT-4o.

Also Read:

Key Findings and Implications

The MasonNLP system ranked 3rd among 19 teams and 51 submissions in the MEDIQA-WV 2025 shared task, achieving an average score of 41.37%. This competitive performance highlights the robustness of their approach. The study demonstrated a clear progression in performance: zero-shot prompting yielded very low scores, few-shot improved formatting but lacked clinical detail, and RAG with textual exemplars significantly boosted specificity and structure. Adding image retrieval further enhanced contextual grounding, particularly for wound-site descriptions and infection cues.

In essence, this research shows that combining powerful general-purpose large language models with a simple, lightweight retrieval-augmented generation framework can create transparent, flexible, and efficient solutions for complex clinical natural language processing and multimodal AI tasks. It shifts AI from generic advice to more specific, schema-consistent, and less hallucinatory answers, making it a promising direction for future advancements in healthcare AI.

For more technical details, you can refer to the full research paper: MasonNLP at MEDIQA-WV 2025: Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing Medical Visual Question Answering with Lightweight AI

The MEDIQA-WV 2025 Challenge: Wound Care VQA

MasonNLP’s Innovative Approach: Lightweight RAG with General-Purpose LLMs

Why RAG is a Game-Changer for MedVQA

How the System Works

Key Findings and Implications

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates