AI Agents Collaborate to Enhance Therapy Recommendations for Complex Patients

TLDR: A new study explores the use of Large Language Model (LLM)-based multi-agent systems (MAS) for safer therapy recommendations in chronic multimorbidity patients. The research designed single-agent and multi-agent frameworks, simulating multidisciplinary team (MDT) decision-making to resolve medical conflicts. While a single LLM agent with intermediate reasoning performed comparably to the MAS in some cases, the multi-agent system showed promise in conflict reduction and aligning with expert judgment. The study also developed new evaluation metrics focusing on clinical goals and medication burden, and provided key lessons for future AI development in healthcare, emphasizing the need for structured outputs and adaptive model selection.

Navigating the complexities of therapy recommendations for patients with multiple chronic conditions, known as multimorbidity, is a significant challenge in healthcare. Clinicians often face the risk of prescribing medications that interact negatively with each other or with existing conditions, leading to potential harm. Traditional decision support systems struggle with scalability, while newer deep learning approaches often lack transparency.

Inspired by how general practitioners (GPs) and multidisciplinary teams (MDTs) collaborate on complex cases, recent research has explored using Large Language Models (LLMs) to assist in this area. A new study delves into the feasibility and value of an LLM-based multi-agent system (MAS) designed to provide safer therapy recommendations.

Simulating Clinical Collaboration with AI

The researchers developed two main frameworks: a single-agent system, where a GP LLM makes recommendations, and a multi-agent system that simulates an MDT. In the multi-agent setup, a GP LLM first reviews a patient’s condition, identifies clinical goals, and detects potential treatment conflicts like drug-drug interactions or contraindications. If conflicts arise, the GP convenes a virtual MDT, bringing in specialized LLM agents relevant to the specific conflicts. These specialists then engage in a collaborative discussion, proposing arguments and evaluating each other’s perspectives until a consensus is reached. If consensus isn’t met, a mediator agent steps in. Once conflicts are resolved, the GP synthesizes the recommendations into a final treatment plan.

This multi-step workflow closely mirrors real-world clinical practice, allowing the LLMs to detect and resolve medication conflicts dynamically. The study used four different LLMs for their experiments: GPT-4o, DeepSeek-V3, Qwen2.5-72B-Instruct, and Mistral-Small-24B-Instruct, evaluating them against benchmark cases of multimorbidity patients.

Key Findings and Insights

The evaluation focused on several metrics beyond just technical accuracy, including correctness, completeness, and ratios for changes in drug-drug interactions, contraindications, met clinical goals, and medication burden. Clinical experts also assessed the system’s explainability, the reasonableness of conflict allocation, and the efficiency of the MDT process.

Interestingly, the study found that with current LLMs, a single GP agent performing intermediate reasoning steps could achieve results comparable to the multi-agent MDT system. This suggests that while the MDT approach is promising, the immediate need for complex multi-agent collaboration might not be as pronounced with today’s LLM capabilities, especially when a single agent is guided through a structured reasoning process.

However, the MAS did show promise in certain scenarios, particularly in reducing conflicts and maintaining an appropriate medication count, aligning more closely with expert judgment for multimorbidity patients. For instance, in a case involving a patient on warfarin with a urinary tract infection, both single-agent and multi-agent systems successfully proposed clinically appropriate alternatives that resolved a critical drug interaction, outperforming a ‘pure’ LLM that lacked intermediate reasoning.

The study also highlighted that no single LLM consistently outperformed others across all cases and evaluation dimensions. Performance varied depending on the clinical scenario and task type, suggesting that future systems might benefit from dynamically selecting or prioritizing models based on their strengths for specific clinical tasks.

Also Read:

Lessons for the Future of AI in Healthcare

The research identified several crucial lessons for developing future LLM-based clinical decision support systems:

Small, conflict-targeted MDTs are more effective and realistic than large, general MDTs, preventing unnecessary medication burden.
While MAS can capture a broader scope of clinical information through agent-to-agent dialogue, this can sometimes introduce inaccurate details.
Structured output formats, detailing clinical goals, actions, and rationales, are vital for interpretable evaluations.
LLM performance is context-dependent, implying that adaptive model selection could improve reliability.

A significant limitation noted was that LLM-generated recommendations often didn’t include all valid therapeutic options, focusing instead on a single preferred solution. Future systems should aim to present a full range of appropriate options, distinguishing between preferred and less-preferred choices with rationales, to better support clinicians in personalized patient care.

This research provides valuable insights into the potential and current limitations of LLM-based multi-agent systems for therapy recommendation. While still evolving, these AI systems hold promise for enhancing patient safety and supporting clinicians in managing complex multimorbidity cases. For more details, you can read the full research paper available at arXiv.org.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI Agents Collaborate to Enhance Therapy Recommendations for Complex Patients

Simulating Clinical Collaboration with AI

Key Findings and Insights

Lessons for the Future of AI in Healthcare

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates