TLDR: A study evaluated multi-turn multi-agent orchestration, where several LLMs (Gemini 2.5 Pro, GPT-5, Grok 4, Claude Sonnet 4) collaborate by proposing answers and voting to reach consensus. The research found that this orchestration method performs as well as or better than the strongest individual LLM across various benchmarks. It also revealed that coordination strategies significantly impact outcomes; for instance, revealing agent identities increases self-voting, and showing ongoing votes can lead to “herding” behavior. The findings suggest substantial potential for further gains through improved coordination mechanisms.
In the rapidly evolving landscape of artificial intelligence, a significant challenge remains: no single large language model (LLM) consistently excels across all benchmarks. This limitation has prompted researchers to explore innovative approaches, and a recent study introduces a compelling solution: multi-turn multi-agent orchestration.
This research, titled “Beyond the Strongest LLM: Multi-Turn Multi-Agent Orchestration vs. Single LLMs on Benchmarks,” delves into how multiple LLM agents can interact over several turns, iteratively proposing answers and casting votes until a consensus is reached. The study, conducted by a team including Aaron Xuxiang Tian, Ruofan Zhang, Jiayao Tang, and others, highlights the potential for collective intelligence to surpass individual performance.
The researchers put this orchestration framework to the test using four prominent LLMs: Gemini 2.5 Pro, GPT-5, Grok 4, and Claude Sonnet 4. They conducted two main experiments. The first benchmarked the orchestration system against these powerful single LLMs on three diverse datasets: GPQA-Diamond (graduate-level questions), IFEval (instruction-following tasks), and MuSR (narrative reasoning). The second experiment involved ablation studies on GPQA-Diamond, examining how different coordination strategies—specifically, whether agents knew who authored answers and if they could see ongoing votes—affected the outcomes.
The findings were striking. The multi-agent orchestration system consistently matched or even slightly exceeded the performance of the strongest single LLM across the benchmarks, while significantly outperforming the weaker models. For instance, on GPQA-Diamond, orchestration achieved 87.4% accuracy, surpassing Gemini 2.5 Pro’s 85.9% and Claude Sonnet 4’s 68.2%. This suggests that by combining complementary strengths, a team of LLMs can deliver top-tier accuracy without needing prior knowledge of which individual model is best for a given task.
Interestingly, the study also revealed that there’s still considerable room for improvement. Even when at least one agent had the correct answer, the orchestration system sometimes failed to converge on it, indicating that better coordination mechanisms could unlock even higher performance.
The ablation studies provided crucial insights into how coordination strategies influence group dynamics. When agents knew the identity of the answer’s author (Identified Voting), there was a noticeable increase in “self-voting”—agents voting for their own answers. This also led to a sharp rise in consensus ties, making it harder for the group to reach a definitive agreement. Conversely, when agents could see others’ votes as they were cast (Visible Tally), a “herding” behavior emerged. Agents tended to follow early majority votes, which sped up convergence but sometimes led to premature, incorrect consensus.
The orchestration framework itself operates in three phases: Agent Action, Consensus, and Final Presentation. In the Agent Action phase, agents asynchronously propose new answers or vote. If a new answer is introduced during voting, a “dynamic restart” is triggered, allowing all agents to re-evaluate with the updated information. The Consensus phase determines the winning answer based on majority votes, and the Final Presentation phase involves the winning agent synthesizing a comprehensive final answer. For a deeper dive into the methodology, you can read the full research paper here.
Also Read:
- Beyond a Single Roll: Why Repetitions Are Key to Reliable LLM Evaluations
- New Scaling Laws for Combining Large Language Models
In conclusion, this research underscores the immense potential of multi-turn multi-agent orchestration for enhancing LLM performance. While it demonstrates that collaborative AI can rival and even surpass the best individual models, it also highlights the critical importance of carefully designed coordination strategies to maximize collective intelligence and avoid pitfalls like self-voting biases or herding behavior.


