TLDR: A new research paper introduces ProtocolBench, a benchmark that systematically evaluates LLM multi-agent communication protocols (A2A, ACP, ANP, Agora) across task success, latency, overhead, and robustness. It reveals that no single protocol is universally optimal, with A2A excelling in task utility and resilience, ACP in low latency, and ANP/Agora in security. The paper also proposes ProtocolRouter, a learnable system that dynamically selects the best protocol for specific scenarios or modules, demonstrating improved performance and reliability over fixed-protocol approaches.
As large language model (LLM) based multi-agent systems become more sophisticated and move from experimental prototypes to real-world applications, a critical but often overlooked factor is the communication protocol layer. This layer dictates how different AI agents talk to each other, and its choice can significantly impact a system’s overall performance and reliability. Historically, selecting a communication protocol has been based on intuition rather than systematic guidance, despite the existence of various protocols like A2A, ACP, ANP, and Agora.
A recent research paper, Which LLM MultiAgent Protocol to Choose?, tackles this challenge head-on. Authored by Hongyi Du, Jiaqi Su, Jisen Li, Lijie Ding, Yingxuan Yang, Peixuan Han, Xiangru Tang, Kunlun Zhu, and Jiaxuan You, the paper introduces a new benchmark called ProtocolBench. This benchmark is designed to systematically compare agent protocols across four key measurable dimensions: task success, end-to-end latency, message or byte overhead, and robustness under failures.
ProtocolBench: A Comprehensive Evaluation
ProtocolBench evaluates protocols across four distinct scenarios, each designed to stress different aspects of the communication layer:
- GAIA Document Question Answering: This scenario focuses on hierarchical information aggregation in collaborative workflows, where agents work together to extract, summarize, and judge evidence from documents.
- Safety Tech: This assesses privacy-preserving communication in a medical Q&A setting, testing transport and session protections against various security probes.
- Streaming Queue: Designed for high-throughput API serving, this scenario evaluates how protocols handle a large volume of requests with queue-based load distribution.
- Fail-Storm Recovery: This tests a system’s resilience under cyclic node failures in a Shard-QA ring, where agents are periodically killed and must rejoin, measuring recovery time and retention of answer discovery.
The findings from ProtocolBench are clear: the choice of protocol profoundly influences system behavior, and no single protocol is a universal winner. Performance trade-offs are highly scenario-dependent.
Key Findings Across Protocols
The research highlights specific strengths for each protocol:
- A2A (Agent-to-Agent Protocol): This protocol excels in task utility, particularly in the GAIA scenario, achieving the highest task quality and success rates. It also demonstrates exceptional resilience in Fail-Storm Recovery, maintaining nearly 99% of its pre-failure answer discovery capability.
- ACP (Agent Communication Protocol): ACP shows superior latency characteristics in the Streaming Queue scenario, achieving the lowest mean response time and smallest variance, making it ideal for high-throughput, latency-critical applications.
- ANP (Agent Network Protocol) and Agora (Meta-Protocol): These protocols provide comprehensive security coverage, including TLS transport security, session hijacking protection, end-to-end encryption, tunnel sniffing resistance, and metadata leakage prevention. This makes them critical for scenarios demanding stringent privacy guarantees, like medical Q&A. However, this enhanced security often comes with increased latency overhead.
Also Read:
- Peer Review for Large Language Models: A Game Theory Approach
- Unlocking Individual Thought: A New Benchmark for Language Models
Introducing ProtocolRouter: Dynamic Protocol Selection
Recognizing that no single protocol dominates all scenarios, the researchers also introduce ProtocolRouter. This is a learnable protocol router that dynamically selects the most suitable protocol for a given scenario or even a specific module within a system, based on requirements and runtime signals. ProtocolRouter doesn’t modify application semantics but performs selection and composition, with stateless encode/decode bridges handling cross-protocol message translation.
Experiments with ProtocolRouter show significant improvements. It can reduce Fail-Storm recovery time by up to 18.1% compared to the best single-protocol baseline and achieve higher success rates in GAIA. This demonstrates that dynamic, scenario-aware protocol selection is a practical approach to building more reliable and efficient multi-agent systems.
In conclusion, the paper underscores that protocol choice is a consequential engineering decision, not an arbitrary one. By providing a standardized evaluation benchmark and a dynamic selection mechanism, this research aims to transform protocol selection from intuition-driven to a principled engineering practice, crucial for the maturation of multi-agent systems into production-ready infrastructure.


