spot_img
HomeResearch & DevelopmentEvaluating AI Models for Cancer Diagnosis Classification in Electronic...

Evaluating AI Models for Cancer Diagnosis Classification in Electronic Health Records

TLDR: This study compared the performance of four large language models (GPT-3.5, GPT-4o, Llama 3.2, Gemini 1.5) and BioBERT in categorizing cancer diagnoses from electronic health records, using both structured ICD codes and unstructured free-text. BioBERT excelled with ICD codes, achieving 90.8% accuracy and an 84.2% weighted macro F1-score. For free-text diagnoses, GPT-4o outperformed BioBERT with a 71.8% weighted macro F1-score and 81.9% accuracy. The research highlights the potential of AI in healthcare data processing but also points out challenges with ambiguous clinical language and the need for human oversight for reliable clinical applications, especially for complex categories like metastasis and ‘Unknown’ diagnoses.

Electronic health records (EHRs) are a treasure trove of patient information, but much of this data is inconsistently structured or exists as free-text notes, making it challenging to use for advanced predictive health models. Manually organizing this information is time-consuming and prone to errors. This is where artificial intelligence (AI) and natural language processing (NLP) tools, especially large language models (LLMs), come into play, offering a promising path to automate diagnosis classification.

A recent study, titled Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study, aimed to systematically evaluate how well various LLMs and a specialized biomedical model, BioBERT, could classify cancer diagnoses from both structured (International Classification of Diseases or ICD codes) and unstructured (free-text) EHR data. The research involved analyzing 762 unique diagnoses from 3456 patient records, categorizing them into 14 predefined cancer types, with validation from oncology experts.

Comparing the Models

The study put five models to the test: GPT-3.5, GPT-4o, Llama 3.2, Gemini 1.5, and BioBERT. Each model’s ability to accurately place diagnoses into categories like “Breast,” “Lung or Thoracic,” “Metastasis,” or “Benign” was assessed.

For structured ICD codes, BioBERT emerged as a top performer, achieving the highest weighted macro F1-score of 84.2% and matching GPT-4o’s accuracy of 90.8%. This highlights the advantage of models specifically trained on biomedical texts when dealing with standardized medical terminology.

However, when it came to free-text diagnoses, the landscape shifted. GPT-4o demonstrated superior performance, outperforming BioBERT with a weighted macro F1-score of 71.8% compared to BioBERT’s 61.5%. GPT-4o also maintained a slightly higher accuracy at 81.9% against BioBERT’s 81.6%. This suggests that while BioBERT is excellent with structured data, general-purpose LLMs like GPT-4o can effectively handle the complexity and variability found in narrative clinical descriptions.

Other models like GPT-3.5, Gemini, and Llama showed lower overall performance, particularly struggling more with free-text entries than with structured ICD codes. This consistent trend across all models underscores the inherent difficulties in processing the nuanced and often informal language found in clinical notes.

Challenges and Misclassification Patterns

The study also shed light on common areas where models struggled. Misclassification patterns included confusion between metastasis and central nervous system tumors, as well as errors involving ambiguous or overlapping clinical terminology. For instance, benign tumors were often misclassified as central nervous system tumors by several models. In free-text analysis, all models found it particularly challenging to correctly assign diagnoses as “Unknown” when information was insufficient, often attempting to force vague descriptions into specific cancer categories.

The unstructured nature of free-text entries, which often contain abbreviations, shorthand, and nonstandardized phrasing, presented significant hurdles. These factors contribute to reduced model accuracy and emphasize the need for better preprocessing or confidence-based filtering of ambiguous inputs.

Also Read:

Implications for Healthcare

The findings suggest that while current AI performance levels are sufficient for administrative and research tasks, such as automating billing and documentation, reliable clinical applications will require more robust solutions. The lower performance on free-text inputs, especially in high-stakes categories like metastasis, indicates that relying solely on automated outputs could pose clinical risks.

The researchers propose a hybrid approach: automating routine cases while flagging ambiguous ones for expert review. This would involve incorporating confidence thresholds to identify uncertain predictions, rule-based validation checks, and human-in-the-loop systems to review low-confidence cases. Such an approach would balance efficiency with the critical need for clinical reliability.

Future work should focus on improving free-text classification, refining thresholds for complex diagnoses, and validating these models across diverse, multi-institutional datasets to ensure their generalizability and robustness in various clinical settings. This research highlights both the immense promise and the significant challenges of integrating language models into automated cancer diagnosis classification, paving the way for more sophisticated and reliable tools in healthcare workflows.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -