Evaluating AI's Clinical Judgment: A New Benchmark for Sequential Reasoning

TLDR: VivaBench is a novel multi-turn benchmark designed to assess the sequential clinical reasoning abilities of Large Language Models (LLMs). By simulating interactive medical examinations, the study found that while LLMs possess medical knowledge, they significantly struggle with actively gathering information, managing diagnostic uncertainty, and avoiding common clinical reasoning errors in dynamic, multi-step scenarios, unlike when provided with complete information upfront. This highlights a critical gap in current AI capabilities for real-world clinical decision support.

Large Language Models (LLMs) are rapidly advancing, and their potential applications in critical fields like healthcare are immense. However, a new study highlights a significant gap between what these AI models know and how they apply that knowledge in real-world, dynamic clinical scenarios. Researchers have introduced VivaBench, a novel benchmark designed to evaluate the sequential clinical reasoning abilities of LLM agents, moving beyond traditional single-turn knowledge recall tests.

The Challenge of Clinical Reasoning for AI

Clinical reasoning in medicine is a complex, hypothesis-driven process. Physicians start with limited information and iteratively gather more data through patient history, physical examinations, and diagnostic tests to refine their diagnoses. Current medical benchmarks for LLMs, however, primarily assess knowledge recall through single-turn questions where all necessary information is provided upfront. This approach fails to capture the dynamic, iterative nature of real-world problem-solving essential in healthcare.

Introducing VivaBench: A Multi-Turn Evaluation

To address this critical gap, Christopher Chiu, Silviu Pitis, and Mihaela van der Schaar developed VivaBench. This multi-turn benchmark simulates a “viva voce” examination, an interactive oral exam used in medical training. The dataset comprises 1152 physician-curated clinical vignettes structured as interactive scenarios. In these scenarios, LLM agents must actively probe for relevant findings, select appropriate investigations, and synthesize information across multiple steps to reach a diagnosis.

How VivaBench Works

VivaBench operationalizes the viva voce concept by creating a simulated diagnostic encounter. An agent receives an initial clinical stem with limited background information. It then proceeds through two distinct phases:

Review Phase: The agent interviews the patient (history taking) and conducts a physical examination. After this, it provides a provisional diagnosis with an associated confidence level.
Investigation Phase: The agent orders laboratory tests or imaging studies to refine its diagnostic hypotheses. Once satisfied, it provides a final diagnosis.

An examiner module retrieves and presents the specifically requested clinical information, mimicking the progressive information exchange of a real medical examination. The dataset uses standardized medical terminologies like SNOMED-CT, LOINC, and ICD-10 to ensure precision.

Key Findings: LLMs Struggle with Uncertainty

The researchers evaluated several state-of-the-art LLMs, including Gemini 2.5 Pro, DeepSeek-R1, o4-mini, Llama-4 Maverick, Grok 3 Mini Beta, and Qwen 3. The results revealed a significant performance degradation when models were required to navigate diagnostic uncertainty in the interactive examination format, compared to scenarios where complete clinical information was provided upfront. While models demonstrated competence in diagnosing conditions with well-described presentations, their accuracy often doubled when given all information at once, suggesting they possess the knowledge but struggle with the process of acquiring it strategically.

Gemini 2.5 Pro consistently outperformed other models, achieving the highest top-1 accuracy in final diagnosis (35%) and full information scenarios (69%). However, even the best models showed a substantial performance gap between interactive and full-information conditions.

Common Failure Modes Identified

The analysis identified several failure modes in LLM reasoning that mirror common issues in clinical practice:

Fixation on initial hypotheses: Models often exhibited anchoring bias, sticking to their first ideas.
Excessive investigation ordering: Agents sometimes ordered too many or inappropriate tests.
Premature diagnostic closure: Identifying one diagnosis and stopping further investigation, potentially missing underlying causes.
Missing critical conditions: Failing to consider or rule out time-sensitive diagnoses.
Inadequate investigations: For example, ordering a non-contrast CT for a pontine stroke, which has low sensitivity for such lesions, and then misdiagnosing.
Inappropriate hypothesis generation: Prioritizing unlikely conditions based on patient context, such as focusing on heart failure in an infant with feeding difficulties without sufficient supporting evidence.

Also Read:

Implications for Medical AI and Beyond

The findings from VivaBench highlight fundamental limitations in how current LLMs manage uncertainty and gather information sequentially. This research has significant implications for the development of medical AI, providing a standardized benchmark for evaluating conversational AI systems intended for real-world clinical decision support. Beyond medicine, it contributes to the broader field of agentic AI by demonstrating how sequential reasoning trajectories can diverge in complex, information-gathering decision-making environments.

For more detailed information, you can read the full research paper available here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating AI’s Clinical Judgment: A New Benchmark for Sequential Reasoning

The Challenge of Clinical Reasoning for AI

Introducing VivaBench: A Multi-Turn Evaluation

How VivaBench Works

Key Findings: LLMs Struggle with Uncertainty

Common Failure Modes Identified

Implications for Medical AI and Beyond

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates