TLDR: A new study explores the use of Large Language Model (LLM)-based multi-agent systems (MAS) for safer therapy recommendations in chronic multimorbidity patients. The research designed single-agent and multi-agent frameworks, simulating multidisciplinary team (MDT) decision-making to resolve medical conflicts. While a single LLM agent with intermediate reasoning performed comparably to the MAS in some cases, the multi-agent system showed promise in conflict reduction and aligning with expert judgment. The study also developed new evaluation metrics focusing on clinical goals and medication burden, and provided key lessons for future AI development in healthcare, emphasizing the need for structured outputs and adaptive model selection.
Navigating the complexities of therapy recommendations for patients with multiple chronic conditions, known as multimorbidity, is a significant challenge in healthcare. Clinicians often face the risk of prescribing medications that interact negatively with each other or with existing conditions, leading to potential harm. Traditional decision support systems struggle with scalability, while newer deep learning approaches often lack transparency.
Inspired by how general practitioners (GPs) and multidisciplinary teams (MDTs) collaborate on complex cases, recent research has explored using Large Language Models (LLMs) to assist in this area. A new study delves into the feasibility and value of an LLM-based multi-agent system (MAS) designed to provide safer therapy recommendations.
Simulating Clinical Collaboration with AI
The researchers developed two main frameworks: a single-agent system, where a GP LLM makes recommendations, and a multi-agent system that simulates an MDT. In the multi-agent setup, a GP LLM first reviews a patient’s condition, identifies clinical goals, and detects potential treatment conflicts like drug-drug interactions or contraindications. If conflicts arise, the GP convenes a virtual MDT, bringing in specialized LLM agents relevant to the specific conflicts. These specialists then engage in a collaborative discussion, proposing arguments and evaluating each other’s perspectives until a consensus is reached. If consensus isn’t met, a mediator agent steps in. Once conflicts are resolved, the GP synthesizes the recommendations into a final treatment plan.
This multi-step workflow closely mirrors real-world clinical practice, allowing the LLMs to detect and resolve medication conflicts dynamically. The study used four different LLMs for their experiments: GPT-4o, DeepSeek-V3, Qwen2.5-72B-Instruct, and Mistral-Small-24B-Instruct, evaluating them against benchmark cases of multimorbidity patients.
Key Findings and Insights
The evaluation focused on several metrics beyond just technical accuracy, including correctness, completeness, and ratios for changes in drug-drug interactions, contraindications, met clinical goals, and medication burden. Clinical experts also assessed the system’s explainability, the reasonableness of conflict allocation, and the efficiency of the MDT process.
Interestingly, the study found that with current LLMs, a single GP agent performing intermediate reasoning steps could achieve results comparable to the multi-agent MDT system. This suggests that while the MDT approach is promising, the immediate need for complex multi-agent collaboration might not be as pronounced with today’s LLM capabilities, especially when a single agent is guided through a structured reasoning process.
However, the MAS did show promise in certain scenarios, particularly in reducing conflicts and maintaining an appropriate medication count, aligning more closely with expert judgment for multimorbidity patients. For instance, in a case involving a patient on warfarin with a urinary tract infection, both single-agent and multi-agent systems successfully proposed clinically appropriate alternatives that resolved a critical drug interaction, outperforming a ‘pure’ LLM that lacked intermediate reasoning.
The study also highlighted that no single LLM consistently outperformed others across all cases and evaluation dimensions. Performance varied depending on the clinical scenario and task type, suggesting that future systems might benefit from dynamically selecting or prioritizing models based on their strengths for specific clinical tasks.
Also Read:
- AI Models Uncover Consensus in Group Discussions
- AI Agents Reshaping Conceptual Engineering Design with Structured Language Models
Lessons for the Future of AI in Healthcare
The research identified several crucial lessons for developing future LLM-based clinical decision support systems:
- Small, conflict-targeted MDTs are more effective and realistic than large, general MDTs, preventing unnecessary medication burden.
- While MAS can capture a broader scope of clinical information through agent-to-agent dialogue, this can sometimes introduce inaccurate details.
- Structured output formats, detailing clinical goals, actions, and rationales, are vital for interpretable evaluations.
- LLM performance is context-dependent, implying that adaptive model selection could improve reliability.
A significant limitation noted was that LLM-generated recommendations often didn’t include all valid therapeutic options, focusing instead on a single preferred solution. Future systems should aim to present a full range of appropriate options, distinguishing between preferred and less-preferred choices with rationales, to better support clinicians in personalized patient care.
This research provides valuable insights into the potential and current limitations of LLM-based multi-agent systems for therapy recommendation. While still evolving, these AI systems hold promise for enhancing patient safety and supporting clinicians in managing complex multimorbidity cases. For more details, you can read the full research paper available at arXiv.org.


