TLDR: A new method called Adaptive Cluster Collaborativeness (ACC) significantly enhances Large Language Models’ (LLMs) ability to assist in medical decisions. It works by intelligently selecting LLMs with diverse outputs (Self-Diversity) and then iteratively removing those that produce inconsistent results (Cross-Consistency). This approach has shown to outperform existing LLMs, including advanced models like GPT-4, on medical datasets, achieving physician-level accuracy while being more computationally efficient.
Large Language Models (LLMs) have shown immense potential in various fields, and healthcare is no exception. These powerful AI systems are increasingly being explored for medical decision support, aiming to assist healthcare professionals with complex tasks. A key aspect of this advancement is the ‘collaborativeness’ of LLMs, where multiple models work together to produce higher-quality outputs.
However, current collaborative LLM systems face significant challenges in medical applications. One major issue is the lack of clear rules for selecting which LLMs should be part of a collaborative group. This often requires human intervention or specific clinical validation. Furthermore, existing setups frequently rely on a fixed group of LLMs, some of which might not perform well in medical scenarios, potentially introducing errors or ‘medical misinformation’ into the collaborative process. This can lead to unreliable results and even cause the collaborative effort to perform worse than a single LLM.
To address these limitations, researchers have proposed an innovative approach called Adaptive Cluster Collaborativeness. This new methodology aims to significantly boost the medical decision support capacity of LLMs by focusing on two core mechanisms: Self-Diversity (SD) maximization and Cross-Consistency (CC) maximization.
Self-Diversity: Building a Strong Foundation
The first mechanism, Self-Diversity (SD), is about intelligently selecting the best LLMs to form the collaborative cluster. The researchers observed that LLMs that generate a wider variety of outputs for the same question tend to perform better. Think of it like a diverse team of experts – each member might approach a problem from a slightly different angle, leading to a more robust overall solution. The SD mechanism calculates a ‘fuzzy matching value’ between different outputs from the same LLM. A higher SD value indicates greater output diversity. By prioritizing LLMs with high SD values, the system can build a cluster of models that are inherently more capable and less prone to narrow, potentially incorrect, interpretations.
Cross-Consistency: Ensuring Cohesion and Accuracy
Once the initial cluster is formed, the Cross-Consistency (CC) mechanism comes into play. This mechanism tackles the problem of inconsistent or low-quality outputs that can degrade collaborative performance. Instead of simply aggregating all outputs from all LLMs, the CC mechanism adaptively refines the collaboration. It works by first identifying the LLM within the cluster that has the highest Self-Diversity. Then, it measures the consistency between this ‘best’ LLM’s output and the outputs of all other LLMs in the cluster. In an iterative process, the LLM with the lowest cross-consistency value (meaning its output is most inconsistent with the best LLM) is ‘masked out’ or removed from the current layer of collaboration. This process is repeated layer by layer, ensuring that only the most consistent and reliable outputs are propagated forward. This adaptive masking significantly reduces the risk of medical misinformation being amplified and improves the overall accuracy of the system.
Also Read:
- Beyond the Buzz: Understanding Large Language Models in Medicine
- Systematically Revealing Implicit Biases in Medical Large Language Models
Real-World Validation and Impact
The effectiveness of this Adaptive Cluster Collaborativeness method was rigorously tested on two specialized medical datasets: NEJMQA and MMLU-Pro-health. NEJMQA, for instance, is based on Israel’s 2022 medical specialist licensing examination, where physicians must achieve a minimum passing score of 65% in each discipline. The results were highly promising. The new method achieved an accuracy rate up to the official passing score across all disciplines on NEJMQA. Notably, in the ‘Obstetrics and Gynecology’ discipline, it reached an accuracy of 65.47%, significantly outperforming GPT-4’s 56.12%.
Beyond accuracy, the research also highlighted the efficiency of the proposed method. Despite using open-access LLMs with smaller parameter sizes (14B to 32B), the system surpassed the performance of much larger models (70B and 141B parameters) and even advanced closed-source models like GPT-4 and GPT-4o-mini. Furthermore, it demonstrated substantial reductions in memory usage and running time compared to other collaborative LLM frameworks, making it a more practical and cost-effective solution for real-world healthcare deployment.
This research marks a significant step forward in making LLMs more reliable and accurate for medical decision support. By intelligently selecting and adaptively refining LLM collaborations, the proposed methodology offers a path to achieving physician-level performance. It’s important to note that this technology is designed to complement, rather than replace, human physicians, especially in areas with limited access to specialized medical expertise. For more details, you can refer to the full research paper: Adaptive Cluster Collaborativeness Boosts LLMs Medical Decision Support Capacity.


