spot_img
HomeResearch & DevelopmentEvaluating AI's Clinical Judgment: A New Benchmark for Sequential...

Evaluating AI’s Clinical Judgment: A New Benchmark for Sequential Reasoning

TLDR: VivaBench is a novel multi-turn benchmark designed to assess the sequential clinical reasoning abilities of Large Language Models (LLMs). By simulating interactive medical examinations, the study found that while LLMs possess medical knowledge, they significantly struggle with actively gathering information, managing diagnostic uncertainty, and avoiding common clinical reasoning errors in dynamic, multi-step scenarios, unlike when provided with complete information upfront. This highlights a critical gap in current AI capabilities for real-world clinical decision support.

Large Language Models (LLMs) are rapidly advancing, and their potential applications in critical fields like healthcare are immense. However, a new study highlights a significant gap between what these AI models know and how they apply that knowledge in real-world, dynamic clinical scenarios. Researchers have introduced VivaBench, a novel benchmark designed to evaluate the sequential clinical reasoning abilities of LLM agents, moving beyond traditional single-turn knowledge recall tests.

The Challenge of Clinical Reasoning for AI

Clinical reasoning in medicine is a complex, hypothesis-driven process. Physicians start with limited information and iteratively gather more data through patient history, physical examinations, and diagnostic tests to refine their diagnoses. Current medical benchmarks for LLMs, however, primarily assess knowledge recall through single-turn questions where all necessary information is provided upfront. This approach fails to capture the dynamic, iterative nature of real-world problem-solving essential in healthcare.

Introducing VivaBench: A Multi-Turn Evaluation

To address this critical gap, Christopher Chiu, Silviu Pitis, and Mihaela van der Schaar developed VivaBench. This multi-turn benchmark simulates a “viva voce” examination, an interactive oral exam used in medical training. The dataset comprises 1152 physician-curated clinical vignettes structured as interactive scenarios. In these scenarios, LLM agents must actively probe for relevant findings, select appropriate investigations, and synthesize information across multiple steps to reach a diagnosis.

How VivaBench Works

VivaBench operationalizes the viva voce concept by creating a simulated diagnostic encounter. An agent receives an initial clinical stem with limited background information. It then proceeds through two distinct phases:

  • Review Phase: The agent interviews the patient (history taking) and conducts a physical examination. After this, it provides a provisional diagnosis with an associated confidence level.
  • Investigation Phase: The agent orders laboratory tests or imaging studies to refine its diagnostic hypotheses. Once satisfied, it provides a final diagnosis.

An examiner module retrieves and presents the specifically requested clinical information, mimicking the progressive information exchange of a real medical examination. The dataset uses standardized medical terminologies like SNOMED-CT, LOINC, and ICD-10 to ensure precision.

Key Findings: LLMs Struggle with Uncertainty

The researchers evaluated several state-of-the-art LLMs, including Gemini 2.5 Pro, DeepSeek-R1, o4-mini, Llama-4 Maverick, Grok 3 Mini Beta, and Qwen 3. The results revealed a significant performance degradation when models were required to navigate diagnostic uncertainty in the interactive examination format, compared to scenarios where complete clinical information was provided upfront. While models demonstrated competence in diagnosing conditions with well-described presentations, their accuracy often doubled when given all information at once, suggesting they possess the knowledge but struggle with the process of acquiring it strategically.

Gemini 2.5 Pro consistently outperformed other models, achieving the highest top-1 accuracy in final diagnosis (35%) and full information scenarios (69%). However, even the best models showed a substantial performance gap between interactive and full-information conditions.

Common Failure Modes Identified

The analysis identified several failure modes in LLM reasoning that mirror common issues in clinical practice:

  • Fixation on initial hypotheses: Models often exhibited anchoring bias, sticking to their first ideas.
  • Excessive investigation ordering: Agents sometimes ordered too many or inappropriate tests.
  • Premature diagnostic closure: Identifying one diagnosis and stopping further investigation, potentially missing underlying causes.
  • Missing critical conditions: Failing to consider or rule out time-sensitive diagnoses.
  • Inadequate investigations: For example, ordering a non-contrast CT for a pontine stroke, which has low sensitivity for such lesions, and then misdiagnosing.
  • Inappropriate hypothesis generation: Prioritizing unlikely conditions based on patient context, such as focusing on heart failure in an infant with feeding difficulties without sufficient supporting evidence.

Also Read:

Implications for Medical AI and Beyond

The findings from VivaBench highlight fundamental limitations in how current LLMs manage uncertainty and gather information sequentially. This research has significant implications for the development of medical AI, providing a standardized benchmark for evaluating conversational AI systems intended for real-world clinical decision support. Beyond medicine, it contributes to the broader field of agentic AI by demonstrating how sequential reasoning trajectories can diverge in complex, information-gathering decision-making environments.

For more detailed information, you can read the full research paper available here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -