TLDR: EMRRG is a novel framework for generating X-ray medical reports that efficiently fine-tunes pre-trained Mamba networks and integrates a hybrid decoder into large language models. It achieves strong performance on benchmark datasets while requiring significantly fewer trainable parameters (2.3% of full fine-tuning), making it highly efficient for clinical applications.
A new framework called EMRRG has been introduced to enhance the generation of medical reports from X-ray images. This development is crucial for artificial intelligence in healthcare, as it aims to lessen the diagnostic burden on clinicians and reduce patient waiting times. Current models for medical report generation (MRG) often rely heavily on large language models (LLMs) but have not fully explored the potential of pre-trained vision foundation models or advanced fine-tuning techniques. Furthermore, while Transformer-based models are prevalent in vision-language tasks, non-Transformer architectures like the Mamba network have remained largely untapped for medical report generation.
The EMRRG framework addresses these gaps by efficiently fine-tuning pre-trained Mamba networks. The process begins with an X-ray image, which is first divided into patches and converted into tokens. These tokens are then processed by a vision backbone based on the State Space Model (SSM), specifically a Mamba network, to extract essential features. The researchers found that a technique called Partial LoRA yielded the best performance for this feature extraction step.
Following feature extraction, an LLM equipped with a unique hybrid decoder generates the medical report. This entire framework supports end-to-end training and has demonstrated impressive results across several widely used benchmark datasets.
Efficient Fine-Tuning with Partial LoRA
One of EMRRG’s core innovations lies in its efficient fine-tuning strategy for the Mamba network. Mamba networks contain numerous intermediate features with distinct properties. Traditional fine-tuning methods often compress all these features into a single low-rank subspace, overlooking their inherent differences. EMRRG overcomes this by introducing LoRAP(X), which selectively applies LoRA adaptations to only a portion of the weights in linear layers based on the structure of the output features, allowing for more refined parameter updates. Additionally, conventional LoRA is applied to the input projection layer to improve the quality of initial image representations, strengthening the discriminative power of features processed by the selective scan mechanism.
The Hybrid Decoder Layer
Another significant component of EMRRG is the hybrid decoder layer within the LLM. This layer extends the standard decoder by integrating a cross-attention mechanism alongside self-attention. While self-attention aggregates contextual information from preceding textual tokens, cross-attention simultaneously extracts relevant visual context from the visual tokens derived from the X-ray image. This enables the model to dynamically focus on key regions within the image, such as lesion sites, leading to more accurate and clinically relevant descriptions. A dynamic gating mechanism is also incorporated to adaptively modulate the fused output, mitigating potential information interference and enhancing training stability.
Also Read:
- Enhancing Clinical Decision Support with Specialized Medical Image Captioning
- Exploring Foundation Models in Medical Imaging: A Comprehensive Review
Performance and Efficiency
The EMRRG framework was rigorously evaluated on three public benchmark datasets: IU X-ray, MIMIC-CXR, and CheXpert Plus. The results showed that EMRRG achieves competitive or superior performance compared to existing state-of-the-art medical report generation algorithms across various natural language generation (NLG) and clinical evaluation (CE) metrics. Notably, on the CheXpert Plus dataset, EMRRG achieved state-of-the-art performance across nearly all evaluation metrics.
Beyond accuracy, EMRRG also stands out in terms of efficiency. The research highlights that the framework requires training only 2.3% of the parameters compared to full fine-tuning methods. This significant reduction in trainable parameters leads to substantially higher training efficiency, making EMRRG a more practical and scalable solution for real-world healthcare applications.
The authors, Mingzheng Zhang, Jinfeng Gao, Dan Xu, Jiangrui Yu, Yuhan Qiao, Lan Chen, Jin Tang, and Xiao Wang, have made their source code publicly available. For a deeper dive into the methodology and experimental details, the full research paper can be accessed here: EMRRG: Efficient Fine-Tuning Pre-trained X-ray Mamba Networks for Radiology Report Generation.
This work marks a notable advancement in medical report generation, providing an efficient and accurate approach to leverage cutting-edge AI models for critical healthcare tasks.


