TLDR: The H-DDx framework introduces a novel hierarchical evaluation method for AI models in differential diagnosis. It addresses the limitations of traditional flat metrics by leveraging the ICD-10 taxonomy to credit clinically relevant near-misses. The research demonstrates that H-DDx provides a more accurate assessment of model performance, particularly highlighting the strengths of domain-specialized models and offering enhanced interpretability of AI’s diagnostic reasoning patterns.
In the complex world of patient care, an accurate differential diagnosis (DDx) is crucial. It guides treatment decisions and significantly impacts patient outcomes. Recently, Large Language Models (LLMs) have shown great promise in helping doctors by generating lists of possible diagnoses from patient stories.
However, the way we currently evaluate these AI models in diagnosis has a major flaw. Most evaluations rely on simple “flat” metrics, like checking if the correct diagnosis is in the top few predictions (Top-k accuracy). The problem is, these metrics don’t differentiate between a “near-miss” that is still clinically relevant (like suggesting a common cold for influenza) and a completely irrelevant error (like suggesting a migraine for influenza). Both are counted as wrong, which doesn’t truly reflect how useful the AI’s suggestion might be to a doctor.
To solve this, researchers have introduced a new evaluation framework called H-DDx. This hierarchical framework is designed to better reflect clinical relevance, offering a more nuanced and interpretable way to assess AI’s diagnostic capabilities. You can read the full research paper here: H-DDx: A Hierarchical Evaluation Framework for Differential Diagnosis.
How H-DDx Works
H-DDx uses the International Classification of Diseases 10th Revision (ICD-10) taxonomy, which is a globally recognized system that organizes medical conditions into a four-level tree structure (Chapter, Section, Category, Subcategory). This structure allows H-DDx to measure not just if a diagnosis is correct, but how “close” an incorrect diagnosis is to the correct one. For example, conditions within the same branch of the ICD-10 tree often share similar anatomical systems or causes, making them clinically related.
The framework involves two main steps. First, it maps the free-text diagnoses generated by LLMs to standardized ICD-10 codes. This is done using a clever pipeline that combines embedding-based retrieval (finding similar codes) with LLM-based reranking (picking the best match). This mapping process is highly accurate, ensuring that the AI’s free-text output can be consistently compared against the structured ICD-10 system.
Second, H-DDx introduces a new metric called Hierarchical DDx F1 (HDF1). Unlike flat metrics, HDF1 expands both the ground-truth (actual diagnosis) and predicted diagnoses to include all their ancestral nodes in the ICD-10 hierarchy. This means that if an AI predicts a diagnosis that is not exactly correct but is closely related within the ICD-10 tree, it receives partial credit. This approach provides a more clinically grounded assessment, acknowledging that identifying the broader medical domain is valuable even if the precise diagnosis is missed.
Key Findings and Insights
The researchers benchmarked 22 leading LLMs using H-DDx and found some significant results. Conventional flat metrics often underestimate the performance of domain-specialized open-source models. For instance, a model called MediPhi, which ranked 20th in traditional Top-5 Accuracy, jumped to 2nd place when evaluated with HDF1. This shows that these specialized models, while not always pinpointing the exact diagnosis, consistently generate lists of possibilities that are clinically coherent and useful within the correct medical domain.
H-DDx also enhances the interpretability of model behavior. By analyzing performance across different levels of the ICD-10 hierarchy (Chapter, Section, Category, Subcategory), the framework revealed a “hierarchical cascade pattern.” All models showed consistent performance degradation as the specificity increased, meaning they were better at identifying the broad medical category (Chapter level) than the highly specific subcategory diagnosis. This insight is invisible to flat metrics and highlights that LLMs often grasp the correct clinical context even when they miss the exact diagnosis.
Also Read:
- DOCTOR-R1: An AI Agent for Empathetic and Strategic Clinical Conversations
- New Benchmark and Dataset Enhance AI Diagnosis for Spine Disorders
Real-World Clinical Value
Case studies further illustrated the practical benefits of HDF1. In one example, a patient with bronchiectasis was correctly diagnosed by a general-purpose model (GPT-4o), giving it a perfect Top-5 Accuracy. However, a specialized model (MediPhi) missed the exact diagnosis, scoring 0.0 on Top-5 Accuracy. Yet, HDF1 assigned MediPhi a much higher score because its predicted list included clinically relevant conditions like pneumonia and bronchitis, which fall within the same respiratory system chapter as bronchiectasis. This demonstrated that HDF1 prioritizes the clinical utility of the entire list of possibilities over a single correct prediction.
Another case showed how HDF1 could differentiate between a base model and its medically fine-tuned version, even when both failed to identify the exact diagnosis by flat metrics. The fine-tuned model received a significantly higher HDF1 score because its suggestions were more clinically relevant and included critical emergencies and taxonomically adjacent diagnoses.
In conclusion, H-DDx offers a more sophisticated and clinically meaningful way to evaluate AI models in differential diagnosis. By moving beyond simple accuracy, it provides a clearer picture of an LLM’s true utility, recognizing the value of clinically relevant near-misses and offering deeper insights into their diagnostic reasoning.


