TLDR: This research details a method to fine-tune MedGemma, a medical vision-language model, to generate highly accurate and clinically specific captions for medical images. By using a knowledge distillation pipeline with GPT-5 to create a synthetic dataset across dermatology, fundus, and chest radiography, and then applying QLoRA for efficient fine-tuning, the model significantly improved both image classification accuracy and the factual quality of its generated captions. This specialized captioning capability is crucial for enhancing multimodal Retrieval-Augmented Generation (RAG) systems that provide evidence-based guidance from Malaysian Clinical Practice Guidelines, reducing the risk of non-factual information from general models.
In the evolving landscape of healthcare, providing accurate and fact-based guidance is paramount, especially when dealing with critical medical information. Retrieval-Augmented Generation (RAG) systems have emerged as a vital tool for this, particularly in contexts like accessing Malaysian Clinical Practice Guidelines (CPGs). However, a significant challenge arises when these systems need to interpret image-based queries. Generic Vision-Language Models (VLMs) often fall short, producing captions that lack the clinical specificity and factual grounding required for reliable medical decision support.
This research introduces and validates a novel framework designed to overcome this limitation. The core idea is to specialize the MedGemma model, a medically-aware VLM, to generate high-fidelity captions that can serve as superior, clinically precise queries for multimodal RAG systems. This enhancement is crucial because general VLMs, primarily trained on natural images, struggle with the intricate details of medical imaging, leading to inaccurate or clinically irrelevant descriptions.
Addressing Data Scarcity with Knowledge Distillation
A major hurdle in fine-tuning specialized medical models is the scarcity of large-scale, high-quality image-caption pairs in specific medical domains. To tackle this, the researchers employed an innovative knowledge distillation pipeline. This process leveraged the advanced capabilities of a state-of-the-art “teacher” model, GPT-5, known for its strong performance in complex multimodal clinical reasoning tasks. GPT-5 was used to create a rich, synthetic dataset by generating nuanced, clinically accurate interpretations from existing medical image classification datasets.
The synthetic dataset was meticulously crafted across three diverse medical domains: dermatology, fundus photography, and chest radiography. Images from publicly available datasets like APTOS (for diabetic retinopathy), NIH Chest X-Ray (for thoracic pathologies), and HAM10000 (for pigmented skin lesions) were systematically fed into the GPT-5 endpoint. The model was instructed to produce structured JSON outputs, including a canonical class label and a detailed description with sections like IMAGE TYPE, ANATOMICAL REGION, KEY FINDINGS, and CLINICAL SIGNIFICANCE. A rigorous filtering process ensured that only factually correct image-caption pairs, matching ground-truth labels, were retained, preventing the propagation of errors.
This meticulous process yielded a high-quality, class-balanced fine-tuning corpus of 1,676 image-caption pairs. This dataset was then partitioned into training, validation, and test sets (70:20:10 split) to ensure a robust and unbiased evaluation of the model’s generalization capabilities.
Fine-Tuning MedGemma with QLoRA
The base model for this study was MedGemma-4B-IT, a VLM built on the Gemma 3 architecture with a specialized vision encoder (MedSigLIP) tuned for medical data. Given the substantial computational resources required for fine-tuning a 4-billion-parameter model, a Parameter-Efficient Fine-Tuning (PEFT) strategy called Quantisation-aware Low-Rank Adaptation (QLoRA) was employed. QLoRA drastically reduces memory overhead while maintaining high model performance.
The fine-tuning utilized an instruction-based format, teaching the model to generate structured clinical captions. Each dataset sample was formatted conversationally, with a system prompt establishing MedGemma’s persona as a “specialist clinician and image interpreter,” a user prompt asking for an interpretation, and the ground-truth JSON object from the distilled dataset serving as the target assistant response. This setup guided the model to learn both the content and the precise, structured format of the clinical captions.
Rigorous Evaluation of Performance
To comprehensively assess the fine-tuned MedGemma model, a multi-faceted evaluation framework was designed. This framework evaluated two key areas: the model’s ability to correctly classify medical images (Ground-Truth Concordance) and the factual quality and visual groundedness of its generated captions (Caption Fidelity and Quality).
For classification, standard metrics such as Accuracy, Balanced Accuracy, Precision, Recall, and F1-Score were used. For caption fidelity, traditional NLP metrics like BLEU or ROUGE were deemed insufficient as they measure lexical overlap rather than clinical accuracy. Instead, the RAGAS framework was employed. In this setup, the high-quality teacher-generated description served as both the “context” and “ground_truth,” while MedGemma’s output was treated as the “answer.” RAGAS scored captions across three dimensions: Faithfulness (factual consistency with context), Answer Relevancy (how well the caption described key findings), and Answer Correctness (factual alignment with ground_truth).
Significant Improvements Demonstrated
The empirical results showed consistent and substantial improvements for the fine-tuned MedGemma model over its baseline counterpart. In classification performance, the dermatology dataset saw the most significant gain, with accuracy surging from a low baseline of 0.0882 to 0.4265. Similar, albeit more moderate, gains were observed in the fundus photography and chest X-ray datasets, confirming the efficacy of domain-specific adaptation.
Crucially, the RAGAS evaluation confirmed that the fine-tuned model generated substantially higher-quality clinical descriptions. Across all three domains, the fine-tuned model achieved superior scores in all evaluated metrics, with dramatic improvements in faithfulness and answer correctness. For instance, faithfulness scores for the fundus dataset increased by nearly 90%, and answer correctness in the dermatology dataset nearly doubled. These results indicate that the fine-tuning process significantly reduced model “hallucinations,” making it a more reliable and factually accurate narrator of medical findings.
Also Read:
- Guide-RAG: Optimizing AI Chatbot Responses for Long COVID Clinical Questions
- Upgrading Multimodal AI Data: The VERITAS Pipeline
Laying the Groundwork for Enhanced Clinical Decision Support
This study successfully demonstrates that targeted, domain-specific fine-tuning can significantly enhance the performance of generalist medical foundation models like MedGemma. The resulting model serves as a validated, high-quality query generator, producing captions with improved diagnostic accuracy and factual fidelity. This work lays the essential groundwork for enhancing multimodal RAG systems, enabling them to provide grounded, evidence-based clinical decision support from Malaysian CPGs based on initial image queries.
While promising, the study acknowledges limitations, such as performance being capped by the teacher model’s quality and the scope being limited to three medical domains. Future work will focus on integrating this specialized captioning model into an end-to-end multimodal RAG system to empirically measure the performance of the full pipeline. For more details, you can refer to the full research paper here.


