A New Benchmark for Evaluating AI in Electronic Health Records: Introducing EHRStruct

TLDR: EHRStruct is a new benchmark framework designed to evaluate how well large language models (LLMs) perform on structured electronic health record (EHR) tasks. It includes 11 diverse clinical tasks, 2,200 evaluation samples from synthetic and real-world data, and assesses 20 LLMs and 11 enhancement methods. Findings show general LLMs often outperform medical ones, and input format and few-shot learning impact results. The paper also introduces EHRMaster, a code-augmented method that achieves state-of-the-art performance on the benchmark.

The field of artificial intelligence is making significant strides in healthcare, particularly with the advent of large language models (LLMs). These powerful AI systems are increasingly being explored for processing complex patient information stored in structured electronic health records (EHRs). Structured EHR data, which includes details like diagnoses, medications, and lab results in relational tables, is crucial for clinical decision-making. However, a major challenge has been the lack of a standardized way to evaluate how well LLMs perform on these specific healthcare tasks.

To address this critical gap, researchers Xiao Yang, Xuejiao Zhao, and Zhiqi Shen have introduced EHRStruct, a new benchmark framework designed to systematically assess and compare the performance of LLMs on structured EHR data. This comprehensive benchmark defines 11 distinct tasks that cover a wide range of clinical needs. These tasks are categorized into six functional types, such as information retrieval, data aggregation, arithmetic computation, clinical identification, diagnostic assessment, and treatment planning. This ensures that the evaluation reflects both the operational diversity and clinical relevance of real-world applications.

EHRStruct includes 2,200 task-specific evaluation samples. These samples are derived from two widely used EHR datasets: Synthea, a synthetic dataset that simulates realistic patient records without privacy concerns, and the eICU Collaborative Research Database, a real-world ICU dataset. This dual-source approach ensures a robust evaluation across both simulated and authentic clinical scenarios. The creation of these samples involved GPT-4o to generate question-answer pairs, which were then validated by both medical and technical experts to ensure correctness and faithfulness to task objectives.

The framework was used to evaluate 20 advanced and representative LLMs, including both general-purpose and medical-specific models. The evaluation explored various factors influencing performance, such as different input formats (plain text, special character separation, graph-structured representation, and natural language description), few-shot generalization, and finetuning strategies. Additionally, EHRStruct compared these LLMs against 11 state-of-the-art LLM-based enhancement methods designed for structured data reasoning.

Key findings from the EHRStruct evaluation revealed several important insights. General LLMs, particularly closed-source commercial models like the Gemini series, consistently outperformed medical-specific models on structured EHR tasks. LLMs generally performed better on “Data-Driven” tasks (those solvable with EHR data alone) compared to “Knowledge-Driven” tasks (those requiring external medical knowledge). The input format also played a role, with natural language inputs benefiting Data-Driven reasoning tasks and graph-structured prompts helping Data-Driven understanding tasks. Few-shot prompting, especially 1-shot and 3-shot settings, improved performance, and multi-task fine-tuning proved more effective than single-task fine-tuning.

In response to the identified challenges and limitations of existing methods, the researchers proposed EHRMaster. This novel code-augmented framework is specifically tailored for structured EHR tasks and operates in three stages: solution planning, concept alignment, and adaptive execution. EHRMaster first generates a high-level solution plan, then aligns key concepts with relevant data fields, and finally decides whether to generate executable code or proceed with direct reasoning. This structured approach allows EHRMaster to achieve state-of-the-art performance on the benchmark, particularly excelling in Data-Driven tasks by consistently achieving perfect scores in many cases. It also offers competitive performance on clinically complex Knowledge-Driven tasks.

Also Read:

EHRStruct provides a much-needed standardized framework for evaluating LLMs in structured EHR environments, offering clear task specifications, a novel dataset, and interpretable evaluation metrics. The insights gained from this benchmark, along with the development of EHRMaster, are expected to guide future research in applying large language models to critical healthcare applications. For more details, you can refer to the original research paper. Read the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

A New Benchmark for Evaluating AI in Electronic Health Records: Introducing EHRStruct

Gen AI News and Updates

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Google Unveils Free 5-Day AI Agents Intensive Course on Kaggle

FaithAct: A Framework for Verifying AI’s Visual Reasoning Steps

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates