H-DDx: A Smarter Approach to Assessing AI's Diagnostic Skills in Healthcare

TLDR: The H-DDx framework introduces a novel hierarchical evaluation method for AI models in differential diagnosis. It addresses the limitations of traditional flat metrics by leveraging the ICD-10 taxonomy to credit clinically relevant near-misses. The research demonstrates that H-DDx provides a more accurate assessment of model performance, particularly highlighting the strengths of domain-specialized models and offering enhanced interpretability of AI’s diagnostic reasoning patterns.

In the complex world of patient care, an accurate differential diagnosis (DDx) is crucial. It guides treatment decisions and significantly impacts patient outcomes. Recently, Large Language Models (LLMs) have shown great promise in helping doctors by generating lists of possible diagnoses from patient stories.

However, the way we currently evaluate these AI models in diagnosis has a major flaw. Most evaluations rely on simple “flat” metrics, like checking if the correct diagnosis is in the top few predictions (Top-k accuracy). The problem is, these metrics don’t differentiate between a “near-miss” that is still clinically relevant (like suggesting a common cold for influenza) and a completely irrelevant error (like suggesting a migraine for influenza). Both are counted as wrong, which doesn’t truly reflect how useful the AI’s suggestion might be to a doctor.

To solve this, researchers have introduced a new evaluation framework called H-DDx. This hierarchical framework is designed to better reflect clinical relevance, offering a more nuanced and interpretable way to assess AI’s diagnostic capabilities. You can read the full research paper here: H-DDx: A Hierarchical Evaluation Framework for Differential Diagnosis.

How H-DDx Works

H-DDx uses the International Classification of Diseases 10th Revision (ICD-10) taxonomy, which is a globally recognized system that organizes medical conditions into a four-level tree structure (Chapter, Section, Category, Subcategory). This structure allows H-DDx to measure not just if a diagnosis is correct, but how “close” an incorrect diagnosis is to the correct one. For example, conditions within the same branch of the ICD-10 tree often share similar anatomical systems or causes, making them clinically related.

The framework involves two main steps. First, it maps the free-text diagnoses generated by LLMs to standardized ICD-10 codes. This is done using a clever pipeline that combines embedding-based retrieval (finding similar codes) with LLM-based reranking (picking the best match). This mapping process is highly accurate, ensuring that the AI’s free-text output can be consistently compared against the structured ICD-10 system.

Second, H-DDx introduces a new metric called Hierarchical DDx F1 (HDF1). Unlike flat metrics, HDF1 expands both the ground-truth (actual diagnosis) and predicted diagnoses to include all their ancestral nodes in the ICD-10 hierarchy. This means that if an AI predicts a diagnosis that is not exactly correct but is closely related within the ICD-10 tree, it receives partial credit. This approach provides a more clinically grounded assessment, acknowledging that identifying the broader medical domain is valuable even if the precise diagnosis is missed.

Key Findings and Insights

The researchers benchmarked 22 leading LLMs using H-DDx and found some significant results. Conventional flat metrics often underestimate the performance of domain-specialized open-source models. For instance, a model called MediPhi, which ranked 20th in traditional Top-5 Accuracy, jumped to 2nd place when evaluated with HDF1. This shows that these specialized models, while not always pinpointing the exact diagnosis, consistently generate lists of possibilities that are clinically coherent and useful within the correct medical domain.

H-DDx also enhances the interpretability of model behavior. By analyzing performance across different levels of the ICD-10 hierarchy (Chapter, Section, Category, Subcategory), the framework revealed a “hierarchical cascade pattern.” All models showed consistent performance degradation as the specificity increased, meaning they were better at identifying the broad medical category (Chapter level) than the highly specific subcategory diagnosis. This insight is invisible to flat metrics and highlights that LLMs often grasp the correct clinical context even when they miss the exact diagnosis.

Also Read:

Real-World Clinical Value

Case studies further illustrated the practical benefits of HDF1. In one example, a patient with bronchiectasis was correctly diagnosed by a general-purpose model (GPT-4o), giving it a perfect Top-5 Accuracy. However, a specialized model (MediPhi) missed the exact diagnosis, scoring 0.0 on Top-5 Accuracy. Yet, HDF1 assigned MediPhi a much higher score because its predicted list included clinically relevant conditions like pneumonia and bronchitis, which fall within the same respiratory system chapter as bronchiectasis. This demonstrated that HDF1 prioritizes the clinical utility of the entire list of possibilities over a single correct prediction.

Another case showed how HDF1 could differentiate between a base model and its medically fine-tuned version, even when both failed to identify the exact diagnosis by flat metrics. The fine-tuned model received a significantly higher HDF1 score because its suggestions were more clinically relevant and included critical emergencies and taxonomically adjacent diagnoses.

In conclusion, H-DDx offers a more sophisticated and clinically meaningful way to evaluate AI models in differential diagnosis. By moving beyond simple accuracy, it provides a clearer picture of an LLM’s true utility, recognizing the value of clinically relevant near-misses and offering deeper insights into their diagnostic reasoning.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

H-DDx: A Smarter Approach to Assessing AI’s Diagnostic Skills in Healthcare

How H-DDx Works

Key Findings and Insights

Real-World Clinical Value

Gen AI News and Updates

Jorie AI Unveils SmartCore Engine: Revolutionizing Healthcare Intelligence and Automation

Get Well and RhythmX AI Unite to Form GW RhythmX, Pioneering AI-Native Healthcare Intelligence

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates