EHR-MCP: Bridging Large Language Models with Electronic Health Records

TLDR: A study evaluated EHR-MCP, a framework integrating large language models (LLMs) with hospital electronic health records (EHR) via the Model Context Protocol (MCP). Using GPT-4.1, the system autonomously retrieved clinically relevant information for infection control tasks in a real hospital setting. Simple tasks achieved near-perfect accuracy, while complex tasks showed challenges related to argument specification and interpretation of lengthy tool outputs. The research demonstrates the potential of LLMs for secure clinical data access and lays groundwork for hospital AI agents, highlighting areas for future development in reasoning and generation.

Large language models (LLMs) are rapidly advancing, showing immense potential across various fields, including medicine. However, integrating these powerful AI systems into real-world hospital environments, especially with sensitive electronic health record (EHR) systems, presents significant challenges. A recent study introduces EHR-MCP, a novel framework designed to bridge this gap by enabling LLMs to autonomously retrieve clinically relevant information from hospital EHRs.

The core idea behind EHR-MCP is the Model Context Protocol (MCP), a standardized interface that allows LLMs to interact with external tools. This protocol reduces the complexity and cost associated with integrating LLMs with diverse hospital information systems. The research aimed to evaluate the accuracy and effectiveness of an LLM, specifically GPT-4.1, connected to an EHR database via EHR-MCP in a live hospital setting.

How EHR-MCP Works

The EHR-MCP framework operates by synchronizing data from the hospital’s EHR system with an in-hospital data warehouse daily. Custom MCP tools, implemented in Python, provide a secure way to query this data using SQL. An LLM client, in this case, GPT-4.1, interacts with these tools through a LangGraph ReAct agent. This agent allows the LLM to dynamically select and execute appropriate tools based on a user’s query, interpret the results, and then generate a final answer. This iterative process mirrors how clinicians gather information, making the AI agent more compatible with human-AI collaboration.

Evaluating Performance in a Real Hospital

The study tested EHR-MCP with six tasks derived from real-world use cases of an infection control team (ICT) at Keio University Hospital. These tasks were categorized into two types: simple tasks, requiring a single tool call (e.g., retrieving body weight or lab data), and complex tasks, demanding multi-step tool use and reasoning (e.g., calculating creatinine clearance or counting antibiotic administration days after a negative blood culture). Eight patient cases, discussed at ICT conferences, were retrospectively analyzed, and the LLM’s outputs were compared against physician-generated gold standards.

Key Findings

The results were promising. The LLM consistently demonstrated the ability to select and execute the correct MCP tools. For simple tasks, EHR-MCP achieved near-perfect accuracy. This indicates that LLMs can reliably retrieve straightforward clinical data when given the right tools.

However, performance was lower in complex tasks, particularly those requiring time-dependent calculations or multi-step interpretation. The study identified that most errors stemmed from two main areas: incorrect arguments passed to the tools (e.g., specifying an inappropriate data retrieval window) and misinterpretation of lengthy or complex tool results by the LLM. For instance, the model sometimes failed to restrict retrieval to the most recent results or included non-blood culture results when only blood cultures were requested.

Despite these challenges, the responses from EHR-MCP were generally reliable. The researchers also noted that lengthy and repetitive data in tool responses sometimes risked exceeding the LLM’s context window, leading to potential degradation in response quality or increased API costs. Hallucinations were also observed when required information was unavailable, though the LLM sometimes recognized these failures.

Also Read:

Implications and Future Directions

This research demonstrates that LLMs, when integrated with EHRs via MCP tools, can autonomously retrieve clinically relevant information in a real hospital setting. This capability is foundational for developing advanced clinical AI agents. EHR-MCP provides a secure and consistent infrastructure for data access, which can accelerate the deployment of generative AI projects across hospital departments.

While the study focused on tool-use capability, future work will expand to evaluate the LLM’s reasoning and generation abilities, as well as its clinical impact on patient outcomes and workflow efficiency. The goal is to move beyond simple retrieval to more comprehensive AI agents that can support complex decision-making in specialties like infectious disease management. You can read the full research paper here: EHR-MCP: Real-world Evaluation of Clinical Information Retrieval by Large Language Models via Model Context Protocol.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

EHR-MCP: Bridging Large Language Models with Electronic Health Records

How EHR-MCP Works

Evaluating Performance in a Real Hospital

Key Findings

Implications and Future Directions

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

Astreya Unveils New Wave of Enterprise AI Agents to Boost Business Efficiency and Automation

Vida Secures $4 Million Series A Funding to Advance AI Voice Technology and Expand Leadership

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates