TLDR: This research introduces AgentFail, a dataset of 307 annotated failure logs from platform-orchestrated agentic systems, along with a three-level taxonomy (agent, workflow, platform) for classifying root causes. It benchmarks LLMs for automated diagnosis, showing that the taxonomy significantly improves accuracy but highlights the task’s inherent difficulty. The study provides actionable guidelines for building more reliable multi-agent AI systems.
As artificial intelligence continues to advance, systems built with multiple Large Language Model (LLM) agents are becoming increasingly common. These ‘agentic systems’ are designed to tackle complex problems by having different AI agents collaborate, using tools and structured interactions. A new trend sees these systems being built rapidly on low-code platforms like Dify and Coze, leading to what researchers call ‘platform-orchestrated agentic systems’. While these platforms make AI development more accessible, the systems they create are often prone to failures, and understanding why they fail has been a significant challenge.
A recent research paper, Diagnosing Failure Root Causes in Platform-Orchestrated Agentic Systems: Dataset, Taxonomy, and Benchmark, by Xuyan Ma, Xiaofei Xie, Yawen Wang, Junjie Wang, Boyu Wu, Mingyang Li, and Qing Wang, addresses this critical issue. The paper introduces a comprehensive approach to systematically identify the underlying reasons for failures in these complex AI setups.
The Challenge of Fragility
The inherent complexity of multi-agent systems, with their interconnected components and dependencies, makes pinpointing the exact cause of a failure incredibly difficult. Previous research often focused on where a failure occurred (e.g., a specific agent or step), but not why it happened – for instance, whether it was due to a poorly designed prompt or a logical deadlock in the workflow. This deeper understanding is crucial for effective repair and improvement.
Introducing AgentFail: A New Dataset for Diagnosis
To bridge this gap, the researchers constructed a unique dataset called AgentFail. This dataset comprises 307 failure logs collected from ten different platform-orchestrated agentic systems on Dify and Coze platforms. Each log is meticulously annotated, linking observed failures to their specific root causes. To ensure the accuracy of these annotations, the team employed a rigorous process, including multi-round expert annotation, consensus building, and cross-validation. They even used ‘counterfactual reasoning’ – essentially, testing if fixing a suspected cause actually resolves the failure – to validate their findings, showing that repairs aligned with their annotations were highly effective.
A Three-Level Taxonomy of Failure Root Causes
A key contribution of the paper is a new, fine-grained taxonomy for classifying failure root causes. This taxonomy organizes failures into three main levels:
- Agent-level Failures: These occur within a single AI agent, often due to limitations of the underlying language model or how it interacts with local tools. Examples include incorrect tool selection, invalid output formats, content deviations, knowledge gaps, poor prompt design, or language encoding issues.
- Workflow-level Failures: These arise from problems in how multiple agents coordinate or communicate. This can be linked to the overall workflow structure. Common issues here include missing input validation, unreasonable dependencies between nodes, loops or deadlocks, faulty conditional logic, improper task decomposition, context conflicts, or mismatches in cross-agent tools.
- Platform-level Failures: These are attributed to the underlying platform or runtime environment itself, such as network fluctuations, resource shortages, or service unavailability.
Insights from Failure Distribution and Impact
Analyzing the AgentFail dataset with this taxonomy revealed some interesting patterns. Agent-level failures were found to be the most dominant, particularly those related to ‘knowledge or reasoning limitations’ and ‘poor prompt design’. Workflow-level failures, while less frequent, often stemmed from ‘missing input validation’ and ‘unreasonable node dependencies’. Platform-level failures were the least common but proved to be the most destructive, frequently leading to complete system termination.
The study also examined the impact of different failure types. Agent-level issues like reasoning limitations or prompt defects often resulted in ‘suboptimal quality’ – meaning the system completed the task but with poor results. In contrast, problems like response formatting errors, language defects, or workflow deadlocks frequently led to ‘execution termination’, where the system simply stopped working.
Benchmarking Automated Diagnosis with LLMs
Recognizing the labor-intensive nature of manual diagnosis, the researchers explored using LLMs to automatically identify root causes. They tested various LLMs (including gpt-4o, LLaMA-3.1-70B, and DeepSeek-R1) in different settings, both with and without the proposed taxonomy as guidance. The results were clear: providing the taxonomy significantly improved the LLMs’ accuracy in identifying root causes, boosting performance by 15-20 percentage points. However, even with this improvement, the highest accuracy reached only 33.6%, underscoring that automated root cause identification remains a challenging task, especially with long and complex failure logs.
Also Read:
- AI Agents Transform Data Analysis: A Comprehensive Overview
- MASLegalBench: A New Standard for Multi-Agent AI in Legal Reasoning
Actionable Guidelines for Robust Systems
Based on their findings, the paper offers practical guidelines for developers to build more reliable platform-orchestrated agentic systems:
- Clear Role Specification and Modular Prompt Design: To mitigate planning errors and response misalignments.
- Explicit Input and Output Validation: To prevent errors from malformed data from spreading.
- Comprehensive Checks and Fallback Mechanisms: To address local problems before they propagate.
- Progressive Workflow Design: Starting with simpler structures and gradually adding complexity to avoid issues like unreasonable dependencies or deadlocks.
In conclusion, this research provides invaluable resources – a reliable dataset, a comprehensive taxonomy, and a benchmark – that lay a foundation for a deeper understanding of why platform-orchestrated agentic systems fail. By offering actionable insights, it aims to support the development of more robust and dependable AI agent solutions in real-world applications.


