DR.INFO Outperforms Leading LLMs on HealthBench for Clinical Query Evaluation

TLDR: A new research paper details the evaluation of DR.INFO, an AI-powered clinical assistant, using OpenAI’s HealthBench benchmark. DR.INFO achieved a score of 0.51 on the challenging HealthBench Hard subset, surpassing leading frontier LLMs like GPT-5 and o3. The evaluation, which uses physician-authored rubrics across various clinical themes and behavioral axes, highlights DR.INFO’s strengths in communication, instruction following, and accuracy, while also identifying areas for improvement in context awareness and completeness. The study emphasizes the importance of behavior-level, clinically grounded benchmarks for developing reliable medical AI.

The field of artificial intelligence in healthcare is rapidly advancing, with large language models (LLMs) showing immense potential to transform clinical practice. However, accurately evaluating these sophisticated AI systems, especially in high-stakes medical scenarios, presents a unique challenge. Traditional benchmarks, often relying on multiple-choice questions, fall short in assessing crucial competencies like contextual reasoning, awareness, and handling uncertainty that are vital in real-world clinical settings.

To address this gap, OpenAI introduced HealthBench, a comprehensive, rubric-driven benchmark designed to evaluate LLMs on realistic, open-ended health conversations. This innovative benchmark moves beyond simple factual recall to assess how AI models behave across complex clinical scenarios, providing a more nuanced understanding of their capabilities and limitations.

Introducing DR.INFO: A Clinical Support Assistant

At Synduct, researchers developed DR.INFO, an agentic retrieval-augmented generation (RAG)-based clinical support assistant. DR.INFO had already demonstrated strong foundational knowledge, achieving an impressive 95.4% accuracy on the USMLE benchmark. However, the team hypothesized that standardized exams like the USMLE might not fully capture DR.INFO’s practical competencies or potential failure modes in the intricate, real-world clinical scenarios. This led them to evaluate DR.INFO using the HealthBench dataset, specifically focusing on its challenging ‘Hard subset’.

Understanding HealthBench: Themes and Behavioral Axes

HealthBench is built around 5,000 realistic clinical conversations, each evaluated using a physician-authored rubric. These rubrics employ fine-grained, clinically relevant criteria across multiple behavioral axes. The benchmark categorizes evaluations into seven high-level themes, representing common health-related tasks:

Emergency referrals: Assessing the model’s ability to recognize urgent situations and advise appropriate care.
Context-seeking: Evaluating if the model can identify missing clinical details and actively ask for them.
Global health: Testing adaptability of advice to different healthcare settings worldwide.
Health data tasks: Measuring skill in structured applications like summarizing notes or interpreting labs.
Expertise-tailored communication: Checking if the model adjusts its language based on the user’s background (layperson or professional).
Responding under uncertainty (Hedging): Evaluating appropriate expression of uncertainty in ambiguous situations.
Response depth: Assessing the provision of an appropriate level of detail.

Each criterion within HealthBench is also categorized by five behavioral axes:

Accuracy: Factual correctness and consistency with clinical knowledge.
Completeness: Inclusion of all essential and relevant information.
Context awareness: Responsiveness based on contextual cues like user role or regional constraints.
Communication quality: Clarity and effectiveness of information presentation.
Instruction following: Compliance with specific user instructions without compromising medical safety.

This multi-dimensional framework allows for a much richer, safety-oriented assessment of model performance than conventional benchmarks.

DR.INFO’s Performance on HealthBench

The evaluation focused on HealthBench’s ‘Hard subset’ of 1,000 challenging examples, designed to stress-test current frontier models. DR.INFO achieved a HealthBench score of 0.51 on this subset, significantly outperforming leading frontier LLMs. For instance, OpenAI’s o3 model scored 0.32, and even the recently reported GPT-5 achieved 0.46 in its “thinking” mode. DR.INFO demonstrated approximately 3.5 times higher performance in context awareness and exceeded the best baseline in completeness by over 15%, indicating a stronger ability to generate thorough and well-reasoned responses. It also consistently outperformed other systems in communication, instruction following, and accuracy.

To further benchmark its performance, DR.INFO was compared against other agentic RAG-based clinical assistants, OpenEvidence and Pathway.md, on a representative subset of 100 samples. DR.INFO maintained its lead with an average HealthBench score of 0.54, compared to 0.49 for OpenEvidence and 0.48 for Pathway.md. It showed a clear advantage in instruction following and a modest edge in context awareness and completeness.

While the differences in scores were statistically significant under a 90% confidence interval, a larger evaluation would be needed for a 95% confidence level, which is a common standard in research. This research paper, titled “OpenAI’s HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries”, provides a detailed look into these evaluations. You can read the full paper here.

Also Read:

The Path Forward

The findings underscore the utility of behavior-level, rubric-based evaluations like HealthBench for building reliable and trustworthy AI-enabled clinical support assistants. While DR.INFO’s score of 0.51 on the hard dataset indicates substantial room for improvement, it currently represents state-of-the-art performance on the HealthBench Hard subset, outpacing both frontier LLMs and similarly designed agentic systems. This highlights the ongoing challenge of clinical deployment and the importance of continuous benchmarking and iteration in the development of medical AI.

The research also acknowledges HealthBench’s limitations, such as the potential for subjective interpretations in rubric-based assessments and its restriction to text-based interactions, not covering other critical modalities like medical imaging or genomics. Future work will involve expanding evaluations to larger datasets and diverse modalities to achieve a more comprehensive assessment of clinical readiness.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

DR.INFO Outperforms Leading LLMs on HealthBench for Clinical Query Evaluation

Introducing DR.INFO: A Clinical Support Assistant

Understanding HealthBench: Themes and Behavioral Axes

DR.INFO’s Performance on HealthBench

The Path Forward

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates