TLDR: A new research paper details the evaluation of DR.INFO, an AI-powered clinical assistant, using OpenAI’s HealthBench benchmark. DR.INFO achieved a score of 0.51 on the challenging HealthBench Hard subset, surpassing leading frontier LLMs like GPT-5 and o3. The evaluation, which uses physician-authored rubrics across various clinical themes and behavioral axes, highlights DR.INFO’s strengths in communication, instruction following, and accuracy, while also identifying areas for improvement in context awareness and completeness. The study emphasizes the importance of behavior-level, clinically grounded benchmarks for developing reliable medical AI.
The field of artificial intelligence in healthcare is rapidly advancing, with large language models (LLMs) showing immense potential to transform clinical practice. However, accurately evaluating these sophisticated AI systems, especially in high-stakes medical scenarios, presents a unique challenge. Traditional benchmarks, often relying on multiple-choice questions, fall short in assessing crucial competencies like contextual reasoning, awareness, and handling uncertainty that are vital in real-world clinical settings.
To address this gap, OpenAI introduced HealthBench, a comprehensive, rubric-driven benchmark designed to evaluate LLMs on realistic, open-ended health conversations. This innovative benchmark moves beyond simple factual recall to assess how AI models behave across complex clinical scenarios, providing a more nuanced understanding of their capabilities and limitations.
Introducing DR.INFO: A Clinical Support Assistant
At Synduct, researchers developed DR.INFO, an agentic retrieval-augmented generation (RAG)-based clinical support assistant. DR.INFO had already demonstrated strong foundational knowledge, achieving an impressive 95.4% accuracy on the USMLE benchmark. However, the team hypothesized that standardized exams like the USMLE might not fully capture DR.INFO’s practical competencies or potential failure modes in the intricate, real-world clinical scenarios. This led them to evaluate DR.INFO using the HealthBench dataset, specifically focusing on its challenging ‘Hard subset’.
Understanding HealthBench: Themes and Behavioral Axes
HealthBench is built around 5,000 realistic clinical conversations, each evaluated using a physician-authored rubric. These rubrics employ fine-grained, clinically relevant criteria across multiple behavioral axes. The benchmark categorizes evaluations into seven high-level themes, representing common health-related tasks:
- Emergency referrals: Assessing the model’s ability to recognize urgent situations and advise appropriate care.
- Context-seeking: Evaluating if the model can identify missing clinical details and actively ask for them.
- Global health: Testing adaptability of advice to different healthcare settings worldwide.
- Health data tasks: Measuring skill in structured applications like summarizing notes or interpreting labs.
- Expertise-tailored communication: Checking if the model adjusts its language based on the user’s background (layperson or professional).
- Responding under uncertainty (Hedging): Evaluating appropriate expression of uncertainty in ambiguous situations.
- Response depth: Assessing the provision of an appropriate level of detail.
Each criterion within HealthBench is also categorized by five behavioral axes:
- Accuracy: Factual correctness and consistency with clinical knowledge.
- Completeness: Inclusion of all essential and relevant information.
- Context awareness: Responsiveness based on contextual cues like user role or regional constraints.
- Communication quality: Clarity and effectiveness of information presentation.
- Instruction following: Compliance with specific user instructions without compromising medical safety.
This multi-dimensional framework allows for a much richer, safety-oriented assessment of model performance than conventional benchmarks.
DR.INFO’s Performance on HealthBench
The evaluation focused on HealthBench’s ‘Hard subset’ of 1,000 challenging examples, designed to stress-test current frontier models. DR.INFO achieved a HealthBench score of 0.51 on this subset, significantly outperforming leading frontier LLMs. For instance, OpenAI’s o3 model scored 0.32, and even the recently reported GPT-5 achieved 0.46 in its “thinking” mode. DR.INFO demonstrated approximately 3.5 times higher performance in context awareness and exceeded the best baseline in completeness by over 15%, indicating a stronger ability to generate thorough and well-reasoned responses. It also consistently outperformed other systems in communication, instruction following, and accuracy.
To further benchmark its performance, DR.INFO was compared against other agentic RAG-based clinical assistants, OpenEvidence and Pathway.md, on a representative subset of 100 samples. DR.INFO maintained its lead with an average HealthBench score of 0.54, compared to 0.49 for OpenEvidence and 0.48 for Pathway.md. It showed a clear advantage in instruction following and a modest edge in context awareness and completeness.
While the differences in scores were statistically significant under a 90% confidence interval, a larger evaluation would be needed for a 95% confidence level, which is a common standard in research. This research paper, titled “OpenAI’s HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries”, provides a detailed look into these evaluations. You can read the full paper here.
Also Read:
- Baichuan-M2: Setting a New Benchmark for Medical AI in Real-World Clinical Settings
- Unlocking Patient Data: How LLMs Are Transforming OPQRST Extraction
The Path Forward
The findings underscore the utility of behavior-level, rubric-based evaluations like HealthBench for building reliable and trustworthy AI-enabled clinical support assistants. While DR.INFO’s score of 0.51 on the hard dataset indicates substantial room for improvement, it currently represents state-of-the-art performance on the HealthBench Hard subset, outpacing both frontier LLMs and similarly designed agentic systems. This highlights the ongoing challenge of clinical deployment and the importance of continuous benchmarking and iteration in the development of medical AI.
The research also acknowledges HealthBench’s limitations, such as the potential for subjective interpretations in rubric-based assessments and its restriction to text-based interactions, not covering other critical modalities like medical imaging or genomics. Future work will involve expanding evaluations to larger datasets and diverse modalities to achieve a more comprehensive assessment of clinical readiness.


