Assessing Hidden Instability in Clinical AI: A New Approach to LLM Evaluation

TLDR: A new framework, LAPD, and metric, LDFR, are introduced to evaluate clinical LLMs’ hidden diagnostic fragility. It shows that small input changes (like masking symptoms) can cause internal reasoning shifts and lead to incorrect diagnoses, even when surface-level text similarity remains high. This highlights the need for deeper, geometry-aware evaluation for safe AI in healthcare.

The paper “Embeddings to Diagnosis: Latent Fragility under Agentic Perturbations in Clinical LLMs” by Raj Krishnan Vijayaraj explores a critical issue in the application of large language models (LLMs) for clinical decision support. While these advanced AI models often show impressive accuracy on standard medical benchmarks, they can surprisingly fail when faced with subtle, yet clinically significant, changes in input. Imagine a scenario where a minor alteration, like masking a symptom or negating a finding in a patient’s notes, leads to a completely different diagnosis from the LLM. This is the “diagnostic fragility” the research addresses.

Traditional evaluation methods for LLMs, which often rely on surface-level metrics like F1 score or BERTScore, are not sensitive enough to detect these underlying issues. They might show high similarity between the original and perturbed text, even if the model’s internal reasoning has fundamentally shifted, leading to an unstable diagnosis. The core problem is that these metrics don’t capture shifts in the model’s “latent representations”—the internal numerical representations that the LLM uses to understand and process information.

To tackle this, the paper introduces a new evaluation framework called LAPD, which stands for Latent Agentic Perturbation Diagnostics. This framework is designed to systematically probe the hidden robustness of clinical LLMs. Within LAPD, a key new metric is proposed: Latent Diagnosis Flip Rate (LDFR). LDFR is a model-agnostic signal that measures how often a small, structured change to the input causes the LLM’s internal “embedding” (its numerical representation of the text) to cross a diagnostic decision boundary in a simplified latent space. Essentially, it checks if the model’s internal understanding of the patient’s condition “flips” even if the text looks similar.

The researchers generated clinical notes using a structured prompting pipeline, mimicking real-world diagnostic reasoning. They then intentionally perturbed these notes along four specific axes: masking (omitting information), negation (reversing a finding), synonym replacement (using different but equivalent terms), and numerical variation (altering vital signs or lab results slightly). These perturbations are not random; they simulate common ambiguities and omissions found in actual clinical documentation.

The study applied LAPD and computed LDFR across both general-purpose foundation LLMs (like GPT-3.5, GPT-4o, LLaMA, and Mistral) and a clinical-specific LLM (MedGemma). A significant finding was that this “latent fragility” emerged even with minimal surface-level changes to the input. For instance, masking entities consistently triggered large changes in LDFR, even when surface metrics like BERTScore remained high (above 0.9). This suggests that LLMs heavily rely on explicitly stated entities for diagnosis. Negation also caused a notable drop in LDFR and accuracy, indicating sensitivity to polarity shifts. Synonym replacement and numerical changes had less impact on surface metrics but still resulted in latent shifts, especially at higher perturbation levels. Numerical perturbations were found to be the least disruptive, implying that LLMs might underutilize quantitative signals in clinical text.

The research also validated these findings on 90 real clinical notes from the DiReCT benchmark (MIMIC-IV dataset), confirming that the generalizability of LDFR extends beyond synthetic settings. The patterns of latent fragility observed in synthetic notes were largely consistent with those in real clinical documentation, although negation led to a sharper degradation in real notes, likely due to the complexity of natural language context.

Furthermore, the study analyzed how variance redistributes across PCA dimensions under perturbation. Masked entities caused variance to concentrate sharply along a few axes, indicating semantic bottlenecks where embeddings compress into narrow subspaces. This “dimensional collapse” aligns with increased LDFR, suggesting that perturbations expose low-dimensional instability patterns that are invisible to surface-level metrics.

Also Read:

The implications of this research are significant for AI safety in healthcare. It reveals a persistent gap between how robust an LLM appears on the surface and its actual semantic stability. LDFR offers a valuable, geometry-aware diagnostic signal that can identify inputs where model outputs are unstable under subtle perturbations. This makes it a promising tool for auditing the reliability of clinical AI systems and ensuring their safe and interpretable deployment in real-world healthcare environments. For more in-depth technical details, you can refer to the full research paper available here: Embeddings to Diagnosis: Latent Fragility under Agentic Perturbations in Clinical LLMs.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Assessing Hidden Instability in Clinical AI: A New Approach to LLM Evaluation

Gen AI News and Updates

Anthropic Reveals First AI-Orchestrated Cyber Espionage Campaign by Chinese State-Sponsored Group

Google Bolsters AI Agent Safeguards with Enhanced Safety Frameworks

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates