Evaluating AI Models for Cancer Diagnosis Classification in Electronic Health Records

TLDR: This study compared the performance of four large language models (GPT-3.5, GPT-4o, Llama 3.2, Gemini 1.5) and BioBERT in categorizing cancer diagnoses from electronic health records, using both structured ICD codes and unstructured free-text. BioBERT excelled with ICD codes, achieving 90.8% accuracy and an 84.2% weighted macro F1-score. For free-text diagnoses, GPT-4o outperformed BioBERT with a 71.8% weighted macro F1-score and 81.9% accuracy. The research highlights the potential of AI in healthcare data processing but also points out challenges with ambiguous clinical language and the need for human oversight for reliable clinical applications, especially for complex categories like metastasis and ‘Unknown’ diagnoses.

Electronic health records (EHRs) are a treasure trove of patient information, but much of this data is inconsistently structured or exists as free-text notes, making it challenging to use for advanced predictive health models. Manually organizing this information is time-consuming and prone to errors. This is where artificial intelligence (AI) and natural language processing (NLP) tools, especially large language models (LLMs), come into play, offering a promising path to automate diagnosis classification.

A recent study, titled Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study, aimed to systematically evaluate how well various LLMs and a specialized biomedical model, BioBERT, could classify cancer diagnoses from both structured (International Classification of Diseases or ICD codes) and unstructured (free-text) EHR data. The research involved analyzing 762 unique diagnoses from 3456 patient records, categorizing them into 14 predefined cancer types, with validation from oncology experts.

Comparing the Models

The study put five models to the test: GPT-3.5, GPT-4o, Llama 3.2, Gemini 1.5, and BioBERT. Each model’s ability to accurately place diagnoses into categories like “Breast,” “Lung or Thoracic,” “Metastasis,” or “Benign” was assessed.

For structured ICD codes, BioBERT emerged as a top performer, achieving the highest weighted macro F1-score of 84.2% and matching GPT-4o’s accuracy of 90.8%. This highlights the advantage of models specifically trained on biomedical texts when dealing with standardized medical terminology.

However, when it came to free-text diagnoses, the landscape shifted. GPT-4o demonstrated superior performance, outperforming BioBERT with a weighted macro F1-score of 71.8% compared to BioBERT’s 61.5%. GPT-4o also maintained a slightly higher accuracy at 81.9% against BioBERT’s 81.6%. This suggests that while BioBERT is excellent with structured data, general-purpose LLMs like GPT-4o can effectively handle the complexity and variability found in narrative clinical descriptions.

Other models like GPT-3.5, Gemini, and Llama showed lower overall performance, particularly struggling more with free-text entries than with structured ICD codes. This consistent trend across all models underscores the inherent difficulties in processing the nuanced and often informal language found in clinical notes.

Challenges and Misclassification Patterns

The study also shed light on common areas where models struggled. Misclassification patterns included confusion between metastasis and central nervous system tumors, as well as errors involving ambiguous or overlapping clinical terminology. For instance, benign tumors were often misclassified as central nervous system tumors by several models. In free-text analysis, all models found it particularly challenging to correctly assign diagnoses as “Unknown” when information was insufficient, often attempting to force vague descriptions into specific cancer categories.

The unstructured nature of free-text entries, which often contain abbreviations, shorthand, and nonstandardized phrasing, presented significant hurdles. These factors contribute to reduced model accuracy and emphasize the need for better preprocessing or confidence-based filtering of ambiguous inputs.

Also Read:

Implications for Healthcare

The findings suggest that while current AI performance levels are sufficient for administrative and research tasks, such as automating billing and documentation, reliable clinical applications will require more robust solutions. The lower performance on free-text inputs, especially in high-stakes categories like metastasis, indicates that relying solely on automated outputs could pose clinical risks.

The researchers propose a hybrid approach: automating routine cases while flagging ambiguous ones for expert review. This would involve incorporating confidence thresholds to identify uncertain predictions, rule-based validation checks, and human-in-the-loop systems to review low-confidence cases. Such an approach would balance efficiency with the critical need for clinical reliability.

Future work should focus on improving free-text classification, refining thresholds for complex diagnoses, and validating these models across diverse, multi-institutional datasets to ensure their generalizability and robustness in various clinical settings. This research highlights both the immense promise and the significant challenges of integrating language models into automated cancer diagnosis classification, paving the way for more sophisticated and reliable tools in healthcare workflows.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating AI Models for Cancer Diagnosis Classification in Electronic Health Records

Comparing the Models

Challenges and Misclassification Patterns

Implications for Healthcare

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates