spot_img
HomeResearch & DevelopmentAI Streamlines Radiology Report Analysis for Image Classification

AI Streamlines Radiology Report Analysis for Image Classification

TLDR: A study evaluated GPT-4o’s ability to automatically extract structured diagnostic labels, including uncertainty, from free-text radiology reports for upper extremity radiographs. It found high label extraction accuracy (98.6%) and demonstrated that these labels could train competitive multi-label image classification models. Interestingly, how “uncertain” labels were handled (as positive or negative) did not significantly affect model performance, suggesting LLMs can efficiently prepare large datasets for AI development in medical imaging.

Developing artificial intelligence models for medical imaging often faces a significant hurdle: a shortage of high-quality, labeled data. Traditionally, creating these datasets involves painstaking manual annotation by experts, a process that is both time-consuming and expensive. While rule-based and conventional machine learning methods have attempted automated label extraction from radiology reports, they frequently struggle with the complex and nuanced language used by radiologists, leading to inaccuracies.

A recent study explores a promising alternative: leveraging large language models (LLMs) for automated label extraction. The research, titled “Large Language Model-Based Uncertainty-Adjusted Label Extraction for Artificial Intelligence Model Development in Upper Extremity Radiography,” investigates the capabilities of GPT-4o in this domain. The authors, Hanna Kreutzer, Anne-Sophie Caselitz, Thomas Dratsch, Daniel Pinto dos Santos, Christiane Kuhl, Daniel Truhn, and Sven Nebelung, aimed to evaluate GPT-4o’s ability to extract structured diagnostic labels, including expressions of uncertainty, from free-text radiology reports. They also tested how these extracted labels would impact the development of multi-label image classification models for musculoskeletal radiographs.

The study focused on radiography series of the clavicle, elbow, and thumb. After anonymizing the reports, GPT-4o was used in a “zero-shot” manner, meaning it operated without prior task-specific fine-tuning. The LLM was instructed to fill out structured templates in JSON format, indicating imaging findings as “true” (present), “false” (absent), or “uncertain.” The “uncertain” category was crucial, designed to capture phrases like “possibly” or “suspected” that are common in radiology reports and reflect diagnostic ambiguity. To understand the impact of this uncertainty, the researchers created two versions of the training and validation datasets: an “inclusive” model where “uncertain” labels were treated as “true,” and an “exclusive” model where they were treated as “false.”

These label-image pairs were then used to train multi-label classification models based on a modified ResNet50 architecture. The accuracy of the automated label extraction was manually verified on both internal and external test sets. The performance of the classification models was assessed using various metrics, including the area under the receiver operating characteristic curve (AUC), precision-recall curves, sensitivity, specificity, and accuracy.

The results were highly encouraging. GPT-4o demonstrated remarkable accuracy in label extraction, correctly identifying 98.6% of labels across the test sets. For instance, clavicle reports achieved 98.8% label-level accuracy, elbow reports 98.6%, and thumb reports 99.0%. The classification models trained with these automatically extracted labels showed competitive performance. For the elbow, the macro-averaged AUC was 0.80 for both inclusive and exclusive models, and similar strong results were observed for the clavicle and thumb. The models also generalized well to external datasets, indicating their robustness.

Interestingly, the study found that the way “uncertain” labels were handled—whether converted to “true” or “false” during training—did not significantly influence the overall performance of the classification models. This suggests that the classifiers can tolerate a modest level of label uncertainty without a measurable impact on their diagnostic capabilities. However, the LLM did face some challenges in automatically detecting all instances of uncertainty compared to manual identification.

The research highlights the significant methodological value of using LLMs like GPT-4o for automated label extraction. It enables the rapid assembly of training data and the efficient development of reliable classifiers, even for less frequently imaged anatomies and for a broad spectrum of findings, not just common fractures. While the models performed exceptionally well on frequent, bone-related labels (with AUC values often exceeding 0.90 for fractures), performance was lower for rarer soft-tissue abnormalities, which are inherently harder to identify on radiographs and had fewer positive cases for training.

Also Read:

This study marks a crucial step towards converting routine clinical data, not originally intended for AI model development, into structured formats suitable for expedited and decentralized AI model development and fine-tuning. Future research will likely focus on multi-institutional collaborations to include more diverse multi-label datasets across various anatomic regions, conditions, and languages. For more details, you can read the full research paper here.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -