TLDR: R2GenKG is a new framework that uses a hierarchical multi-modal knowledge graph (M3KG) to improve X-ray medical report generation by large language models. It addresses issues like hallucination and weak disease diagnosis by integrating structured medical knowledge and visual features, leading to more accurate and clinically relevant reports.
In the evolving landscape of artificial intelligence in healthcare, the automated generation of X-ray medical reports stands out as a crucial application. While large foundation models have significantly enhanced the quality of these reports, persistent challenges such as the generation of inaccurate information (hallucination) and limited disease diagnostic capabilities remain. A new research paper introduces R2GenKG, a novel framework designed to tackle these issues by integrating a hierarchical multi-modal knowledge graph with large language models for more accurate and clinically relevant radiology report generation.
Introducing M3KG: A Multi-modal Medical Knowledge Graph
The core of the R2GenKG framework is a newly constructed, large-scale multi-modal medical knowledge graph, termed M3KG. This knowledge graph is built using ground truth medical reports and leverages advanced AI models like GPT-4o for its construction. M3KG is comprehensive, containing 2477 entities, three types of relations, 37424 triples, and 6943 disease-aware vision tokens specifically for the CheXpert Plus dataset. Unlike previous knowledge graphs that often relied on manual annotations or focused solely on semantic representations, M3KG integrates multi-modal information, including visual data, which is crucial for a complete understanding of medical cases.
The construction of M3KG involves three main stages. Initially, GPT-4o is used to annotate a subset of radiology reports, generating training data for entities and relations. This data then trains Named Entity Recognition (NER) and relation extraction models. In the final stage, disease-aware visual patches, nodes, and edges are extracted to form the multi-modal medical knowledge graph. Entities within M3KG are rich with attributes like CUI (Concept Unique Identifier), name, definition, and aliases, and are categorized into types such as Anatomy, Disorder, Concept, Device, Procedure, and Size. Relationships between entities include ‘modify’, ‘located at’, and ‘suggestive of’.
R2GenKG: A Hierarchical Framework for Report Generation
Building upon the M3KG, the R2GenKG framework processes X-ray images and integrates knowledge from the graph to generate detailed medical reports. For an input X-ray image, visual features are extracted using a Swin-Transformer encoder and aligned with the large language model (LLM) using a Q-former. Crucially, R2GenKG retrieves disease-aware vision tokens from the multi-modal knowledge graph to enrich the visual representation of the input image.
Simultaneously, the medical knowledge graph is sampled to obtain multi-grained semantic graphs, which are then encoded using an R-GCN encoder. This multi-granularity approach allows the model to understand knowledge at various levels of detail, from broad overviews to fine-grained specifics. The visual features and graph-enhanced tokens are then fused, undergoing cross-attention mechanisms to ensure deep interaction between vision and knowledge graph information. Finally, these integrated features are fed into a large language model, specifically Llama2-7B, to generate the medical report.
Key Contributions and Performance
The researchers highlight three main contributions of this work: the development of the M3KG construction system, the proposal of the R2GenKG framework, and extensive experimental validation. The R2GenKG framework fully utilizes the multi-modal and multi-granularity information from the knowledge graph to enhance visual feature representation and significantly improve the model’s capability for clinical disease discovery.
Extensive experiments were conducted on two widely used benchmark datasets: IU-Xray and CheXpert Plus. R2GenKG demonstrated superior performance across various natural language generation (NLG) metrics such as BLEU, ROUGE-L, METEOR, and CIDEr, as well as clinical efficacy (CE) metrics like Precision, Recall, and F1 Score. This indicates that R2GenKG generates reports that are not only linguistically coherent but also clinically accurate, effectively identifying pathological features.
Ablation studies further confirmed the positive impact of each component within the R2GenKG model, including the Relational Graph Convolutional Network (RGCN), Multi-scale Feature Fusion, and the Disease Visual Graph module. The studies also optimized the number of entity nodes and visual features for peak performance, finding that a moderate number of entities (around 300) and visual features (around 500) yielded the best results, balancing information richness with noise reduction.
Also Read:
- CX-Mind: Advancing Chest X-ray Diagnosis with Transparent AI Reasoning
- A New Approach to Radiology Question Answering Using AI Agents
Future Directions
While R2GenKG marks a significant advancement, the researchers acknowledge limitations, primarily the high computational costs associated with training and inference, which might restrict its deployment in resource-constrained clinical settings. Furthermore, there’s a recognized need for deeper alignment mechanisms between visual disease features and textual graphs to fully exploit the potential of cross-modal fusion. The source code for this paper will be released on GitHub.


