TLDR: A new framework, LAPD, and metric, LDFR, are introduced to evaluate clinical LLMs’ hidden diagnostic fragility. It shows that small input changes (like masking symptoms) can cause internal reasoning shifts and lead to incorrect diagnoses, even when surface-level text similarity remains high. This highlights the need for deeper, geometry-aware evaluation for safe AI in healthcare.
The paper “Embeddings to Diagnosis: Latent Fragility under Agentic Perturbations in Clinical LLMs” by Raj Krishnan Vijayaraj explores a critical issue in the application of large language models (LLMs) for clinical decision support. While these advanced AI models often show impressive accuracy on standard medical benchmarks, they can surprisingly fail when faced with subtle, yet clinically significant, changes in input. Imagine a scenario where a minor alteration, like masking a symptom or negating a finding in a patient’s notes, leads to a completely different diagnosis from the LLM. This is the “diagnostic fragility” the research addresses.
Traditional evaluation methods for LLMs, which often rely on surface-level metrics like F1 score or BERTScore, are not sensitive enough to detect these underlying issues. They might show high similarity between the original and perturbed text, even if the model’s internal reasoning has fundamentally shifted, leading to an unstable diagnosis. The core problem is that these metrics don’t capture shifts in the model’s “latent representations”—the internal numerical representations that the LLM uses to understand and process information.
To tackle this, the paper introduces a new evaluation framework called LAPD, which stands for Latent Agentic Perturbation Diagnostics. This framework is designed to systematically probe the hidden robustness of clinical LLMs. Within LAPD, a key new metric is proposed: Latent Diagnosis Flip Rate (LDFR). LDFR is a model-agnostic signal that measures how often a small, structured change to the input causes the LLM’s internal “embedding” (its numerical representation of the text) to cross a diagnostic decision boundary in a simplified latent space. Essentially, it checks if the model’s internal understanding of the patient’s condition “flips” even if the text looks similar.
The researchers generated clinical notes using a structured prompting pipeline, mimicking real-world diagnostic reasoning. They then intentionally perturbed these notes along four specific axes: masking (omitting information), negation (reversing a finding), synonym replacement (using different but equivalent terms), and numerical variation (altering vital signs or lab results slightly). These perturbations are not random; they simulate common ambiguities and omissions found in actual clinical documentation.
The study applied LAPD and computed LDFR across both general-purpose foundation LLMs (like GPT-3.5, GPT-4o, LLaMA, and Mistral) and a clinical-specific LLM (MedGemma). A significant finding was that this “latent fragility” emerged even with minimal surface-level changes to the input. For instance, masking entities consistently triggered large changes in LDFR, even when surface metrics like BERTScore remained high (above 0.9). This suggests that LLMs heavily rely on explicitly stated entities for diagnosis. Negation also caused a notable drop in LDFR and accuracy, indicating sensitivity to polarity shifts. Synonym replacement and numerical changes had less impact on surface metrics but still resulted in latent shifts, especially at higher perturbation levels. Numerical perturbations were found to be the least disruptive, implying that LLMs might underutilize quantitative signals in clinical text.
The research also validated these findings on 90 real clinical notes from the DiReCT benchmark (MIMIC-IV dataset), confirming that the generalizability of LDFR extends beyond synthetic settings. The patterns of latent fragility observed in synthetic notes were largely consistent with those in real clinical documentation, although negation led to a sharper degradation in real notes, likely due to the complexity of natural language context.
Furthermore, the study analyzed how variance redistributes across PCA dimensions under perturbation. Masked entities caused variance to concentrate sharply along a few axes, indicating semantic bottlenecks where embeddings compress into narrow subspaces. This “dimensional collapse” aligns with increased LDFR, suggesting that perturbations expose low-dimensional instability patterns that are invisible to surface-level metrics.
Also Read:
- Systematically Revealing Implicit Biases in Medical Large Language Models
- Beyond the Buzz: Understanding Large Language Models in Medicine
The implications of this research are significant for AI safety in healthcare. It reveals a persistent gap between how robust an LLM appears on the surface and its actual semantic stability. LDFR offers a valuable, geometry-aware diagnostic signal that can identify inputs where model outputs are unstable under subtle perturbations. This makes it a promising tool for auditing the reliability of clinical AI systems and ensuring their safe and interpretable deployment in real-world healthcare environments. For more in-depth technical details, you can refer to the full research paper available here: Embeddings to Diagnosis: Latent Fragility under Agentic Perturbations in Clinical LLMs.


