TLDR: A new “Coarse-to-Fine” AI framework uses open-source Large Language Models (LLMs) to automatically generate and personalize the “Impression” section of radiology reports. This system first creates a draft and then refines it using machine learning and human feedback to match individual radiologist styles and ensure accuracy. It aims to reduce radiologist burnout and improve reporting efficiency while maintaining high clinical precision.
The demanding task of manually creating the “Impression” section in radiology reports is a significant contributor to radiologist burnout. This crucial part of a report summarizes clinical findings and guides referring physicians, but its creation is complex, time-consuming, and requires high personalization and domain-specific language. To address this, researchers have introduced a novel “Coarse-to-Fine” framework that leverages open-source Large Language Models (LLMs) to automate and personalize these impressions.
A New Approach to Radiology Reporting
The proposed framework aims to significantly reduce the administrative workload on radiologists and enhance reporting workflows while maintaining high standards of clinical precision. Unlike general-purpose LLMs, which often lack the specialized vocabulary, style, and clinical nuances required for medical reporting, this new system is designed for fine-grained control over content and structure, ensuring consistency and alignment with medical standards.
The Coarse-to-Fine framework operates in two main stages. It begins with a “coarse-grained” summary of the clinical findings, capturing essential information. This initial draft is then iteratively refined through a “fine-grained” customization process. This refinement incorporates patient-specific context, ensures clinical precision, and aligns the output with individual radiologists’ stylistic preferences. Reinforcement Learning from Human Feedback (RLHF) is a key component in this stage, ensuring the generated impressions are factually accurate and tailored to the needs of both clinicians and patients.
Under the Hood: Models and Data
The research involved fine-tuning prominent open-source LLMs, specifically LLaMA and Mistral models, on a vast dataset of 957,134 de-identified radiology reports from the University of Chicago Medicine. This extensive dataset, curated over 12 years, provides a rich source of clinical information, detailed findings, and concise impressions, making it ideal for training LLMs for summarization tasks in the medical domain.
During the model selection phase, LLaMA-3.1-8b consistently outperformed other models like Gemma-2-9b and Mistral-7b across various metrics, including ROUGE, BLEU, and BERTScore, which measure syntactic similarity, lexical accuracy, and semantic similarity, respectively. While Mistral-7b slightly edged out LLaMA-3.1-8b in factual consistency, LLaMA-3.1-8b demonstrated the most balanced performance overall, making it the chosen base model for the framework.
The model’s training involved parameter-efficient fine-tuning (PEFT) using Low-Rank Adaptation (LoRA), a technique that allows efficient adaptation to new tasks with minimal computational overhead. This approach, combined with Supervised Fine-Tuning (SFT), enabled the model to learn from domain-specific radiology datasets and generalize effectively even with limited examples.
Personalization and Evaluation
A key feature of the Coarse-to-Fine framework is its ability to generate personalized impressions tailored to different target audiences. This is achieved through a sophisticated prompt engineering strategy that allows for three types of summaries:
- Brief Summarization: Simplified for non-English speakers.
- Bullet Point Summarization: Concise insights for quick review.
- Comprehensive Summarization: Detailed summaries for experts.
The effectiveness of the framework was rigorously evaluated. Human assessments involving radiologists from UC Medicine and an independent board-certified radiologist were conducted. Out of 200 generated reports, 79.5% received either “neutral” or “positive” ratings, indicating that the AI-generated impressions were considered by radiologists to be at least as accurate as human-generated ones. Notably, the model sometimes captured incidental findings that were omitted in original human impressions.
Furthermore, the model demonstrated remarkable stability against real-world data entry errors, showing minimal degradation in performance even with a simulated 3% typographical error rate. Its generalizability was also validated by successfully summarizing key clinical findings from an external dataset (CheXpert Plus), with radiologists rating 80% of these generated impressions as equal to or better than the originals.
Also Read:
- REFINE: Enhancing Multimodal AI Performance Through Targeted Error Feedback
- RenalCLIP: A Specialized AI Model for Kidney Cancer Assessment
Looking Ahead
This research marks a significant step towards integrating advanced AI into clinical workflows, offering a promising solution to alleviate radiologist burnout and improve the efficiency and quality of medical reporting. Future work aims to integrate visual data and explore advanced multi-modal models to further enhance clinical reasoning and support real-world diagnostic workflows. For more details, you can read the full research paper here.


