Evaluating AI Agents: Introducing MCP-AgentBench for Real-World Tool Use

TLDR: MCP-AgentBench is a new benchmark designed to evaluate how well language AI agents use tools in real-world scenarios, especially those integrated via the Model Context Protocol (MCP). It features a testbed of 33 servers and 188 tools, 600 diverse queries across 6 complexity categories, and an outcome-focused evaluation method called MCP-Eval. Initial findings show open-source models can rival proprietary ones, and performance heavily depends on the interaction framework used (ReAct vs. Tool Calling). The benchmark also identifies common failure modes for agents in tool-use scenarios.

A new benchmark called MCP-AgentBench has been introduced to rigorously evaluate how well language AI agents perform in real-world situations, particularly when interacting with tools through the Model Context Protocol (MCP). This research addresses a critical gap in current evaluation methods, which often fail to accurately capture agent capabilities within this evolving framework.

Understanding the Model Context Protocol (MCP)

The Model Context Protocol (MCP) is an emerging open standard designed to improve how AI agents integrate and work together with various tools. It aims to create a new era of powerful, interconnected, and truly useful AI systems. Unlike traditional methods where tools are treated as separate, callable functions, MCP allows agents to interact with a server as a more complete entity, potentially managing context or state across multiple operations. It also standardizes feedback from the environment, providing richer and more consistent responses, and is envisioned to support dynamic tool discovery, reducing the effort needed for custom integrations.

The Need for a New Benchmark

Despite MCP’s growing adoption, existing benchmarks, often designed for older function-calling paradigms, don’t adequately assess agent performance in MCP-mediated interactions. This can lead to a skewed perception of an agent’s true operational value and makes it difficult to reliably distinguish between different agents’ proficiencies. For example, some models might perform well on traditional benchmarks but struggle in real-world, MCP-driven tasks.

Introducing MCP-AgentBench

Developed by Zikang Guo, Benfeng Xu, Chiwei Zhu, Wentao Hong, Xiaorui Wang, and Zhendong Mao from the University of Science and Technology of China and MetastoneTechnology, Beijing, China, MCP-AgentBench is specifically engineered to bridge this evaluation gap. The benchmark’s core contributions include:

A Robust Testbed: It features a comprehensive MCP testbed comprising 33 operational servers that offer access to 188 distinct tools. These servers were carefully selected for their stability, stateless operation, and reliance on text-based interactions.
Diverse Queries: The benchmark includes 600 systematically designed queries distributed across 6 distinct categories. These categories vary in interaction complexity, from simple single-server operations to complex multi-server sequential workflows requiring sophisticated planning and information synthesis.
Outcome-Oriented Evaluation: MCP-AgentBench introduces MCP-Eval, a novel evaluation methodology that prioritizes real-world task success. Instead of rigidly adhering to specific execution paths, it assesses whether the agent successfully achieves the desired outcome, recognizing that multiple valid solutions often exist.

How MCP-AgentBench Was Built

The construction of MCP-AgentBench followed a rigorous three-stage process. First, the MCP server testbed was established through meticulous curation and deployment. Second, diverse and realistic queries were generated with the assistance of a large language model (LLM) and human verification. These queries were categorized based on ‘server scope’ (single or multi-server) and ‘call dependency’ (single, parallel, or sequential calls). Each query was designed with a specific tool selection, user profile, scenario description, and a clear objective. Finally, gold-standard reference answers were annotated using a hybrid framework that combined LLM-generated execution trajectories with expert human review, ensuring high quality and addressing complex cases.

Key Findings from Evaluations

The researchers conducted extensive empirical evaluations of 10 leading language agents, including both proprietary systems (like Anthropic’s Claude models, OpenAI’s GPT-4o and o3-mini, and Google’s Gemini models) and prominent open-source architectures (such as Qwen, Kimi K2, and DeepSeek). The evaluations utilized both the ReAct framework and native Tool Calling (TC) modes.

A surprising finding was that leading open-source models demonstrated exceptional capabilities, in some cases even surpassing their proprietary counterparts. Notably, Qwen3-235B-A22B, when using the ReAct framework, achieved the highest overall score in the benchmark. Among proprietary models, Anthropic’s Claude 4 Sonnet performed best, especially with its native Tool Calling capabilities. In contrast, GPT-4o showed significant underperformance across all tested scenarios.

The study also highlighted that an agent’s performance is highly dependent on the interaction framework used, with no single universally superior option. For instance, Qwen3-235B-A22B excelled with ReAct but saw a drastic performance drop in TC mode, while Claude 4 Sonnet improved significantly with TC. This underscores the importance of selecting the right framework for a given model.

Furthermore, the evaluations confirmed that task difficulty generally increases with more servers involved and greater call dependency. The consistency between MCP-Eval and human evaluations was also found to be very high, validating the benchmark’s reliability.

Common Agent Failure Modes

The analysis identified several recurring error categories for LLM agents in protocol-driven, tool-use scenarios:

Misinterpretation of Query: Agents failing to accurately understand the user’s main objective or critical details.
Refusal to Use Tool: Agents improperly relying on their internal knowledge instead of invoking necessary tools for external or dynamic data.
Omission of Key Information: Agents providing incomplete responses or failing to synthesize essential information from tool outputs.
Hallucination: Agents fabricating information that is not supported by or contradicts tool outputs.

Also Read:

Future of AI Agents

MCP-AgentBench aims to provide the research community with a standardized and reliable framework to build, validate, and advance agents that can fully leverage the transformative benefits of MCP. This work is crucial for accelerating progress toward truly capable and interoperable AI systems. For more details, you can read the full research paper here.