spot_img
HomeResearch & DevelopmentSmart Data Selection for Accurate Lung Disease Classification

Smart Data Selection for Accurate Lung Disease Classification

TLDR: This study explores deep active learning with Bayesian Neural Networks and weighted loss to classify lung disease severity from chest X-rays, especially when data is limited and imbalanced. It found that Entropy Sampling for binary classification and Mean STD for multi-class classification significantly reduce the amount of labeled data needed (15.4% and 23.1% respectively) while maintaining or improving diagnostic performance compared to traditional methods. This approach offers a more efficient and scalable way to deploy AI in medical imaging.

The rapid increase in pulmonary diseases and the ongoing shortage of radiologists have created an urgent need for efficient and accurate diagnostic tools. Chest X-rays (CXRs) are a widely used and cost-effective imaging method, but the sheer volume of images, combined with limited expert resources, often leads to delays in diagnosis and treatment. Artificial intelligence (AI) models offer a promising solution, but they typically require vast amounts of high-quality, labeled training data. Obtaining such data in the medical field is expensive and demands specialized expertise.

This is where active learning comes into play. Active learning is a machine learning technique designed to reduce the amount of labeled data needed by intelligently selecting the most informative samples for training. Instead of labeling every available image, active learning focuses on those that will most effectively improve the model’s performance. However, medical datasets often present a unique challenge: class imbalance. This means certain conditions or severity levels are underrepresented, which can lead to biased AI models that don’t perform well across all patient cases.

A recent study, titled Deep Active Learning for Lung Disease Severity Classification from Chest X-rays: Learning with Less Data in the Presence of Class Imbalance, addresses these challenges by integrating deep active learning with a Bayesian Neural Network (BNN) approximation and a weighted loss function. The research aimed to classify the severity of lung disease from CXRs, using COVID-19 as a case study, while effectively managing class imbalance and minimizing the need for extensive data labeling.

The study utilized a retrospective dataset of 2,319 CXRs from 963 COVID-19 patients at Emory Healthcare affiliated hospitals. Each CXR was independently labeled by three to six board-certified radiologists, categorizing disease severity as normal, moderate, or severe. For the AI model, a ResNet50 architecture with Monte Carlo (MC) Dropout layers was employed, approximating a BNN. A crucial aspect of their approach was the weighted loss strategy, which assigned more penalty to errors in minority classes, thereby counteracting the effects of class imbalance.

The active learning framework iteratively expanded the training set by selecting the most informative samples from an unlabeled pool. Various acquisition functions were tested to guide this selection process, including Random Sampling, Entropy Sampling, BatchBALD, Mean STD, Least Confidence, Margin Sampling, and Variation Ratios. These functions evaluate the informativeness of unlabeled samples to decide which ones would most benefit the model if labeled.

Key Findings and Performance

The results demonstrated significant efficiency gains. For binary classification (normal vs. diseased), Entropy Sampling emerged as the most effective technique. It achieved a high accuracy of 93.7% and an Area Under the Receiver Operating Characteristic curve (AU ROC) of 0.91, using only 15.4% of the total training data. This performance not only matched but, in some cases, surpassed the baseline achieved with a much larger dataset.

In the multi-class setting (normal, moderate, severe), Mean STD sampling proved most effective. It achieved 70.3% accuracy and an AU ROC of 0.86, utilizing just 23.1% of the labeled data. Both Entropy Sampling and Mean STD consistently outperformed more complex and computationally expensive acquisition functions, as well as simple random sampling, confirming the value of their approach.

A notable aspect of these optimal acquisition functions was their tendency to oversample minority classes. For instance, Entropy Sampling selected 35.3% normal samples in the binary setting (compared to the original 14% normal class), and Mean STD sampled 30.5% normal and 41.8% moderate samples in the multi-class setting, effectively mitigating class imbalance during the learning process. Furthermore, these methods maintained relatively low acquisition times, making them practical for real-world deployment.

Also Read:

Implications for Medical AI

This research highlights that deep active learning, especially when combined with BNN approximation and a weighted loss strategy, can substantially reduce the data labeling burden for medical image classification. By achieving competitive diagnostic accuracy with a fraction of the data, this approach offers a cost-effective and scalable solution for deploying AI in medical imaging tasks, particularly in scenarios with imbalanced datasets.

The flexibility of choosing different acquisition functions based on specific clinical priorities (e.g., optimizing for accuracy, sensitivity, or precision) further enhances the practicality of this method. This adaptability is crucial for rapid deployment of diagnostic models during pandemics or for rare diseases where labeling resources are scarce. The study paves the way for accelerating the development and deployment of AI-driven diagnostic tools, ultimately supporting radiologists by minimizing the data annotation required to achieve high performance.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -