Famas: A New Approach to Pinpointing Failures in Multi-Agent AI Systems

TLDR: Famas is the first spectrum-based method for automatically attributing failures in Large Language Model-powered Multi-Agent Systems (MASs). It addresses the challenge of identifying which specific agent actions cause system failures by replaying tasks, abstracting execution trajectories, and applying a novel spectrum analysis formula. This formula considers both agent and action behavior patterns to calculate a ‘suspiciousness score’ for each action. Evaluated on the Who&When benchmark, Famas significantly outperforms existing LLM-based and random attribution methods, demonstrating superior accuracy in identifying faulty agents and actions, especially in complex MAS environments.

Multi-Agent Systems (MASs), powered by Large Language Models (LLMs), are becoming increasingly vital for automating complex tasks, from programming to scientific discovery. These systems, where multiple AI agents collaborate, hold immense promise. However, like any sophisticated technology, MASs are not without their flaws. When a system fails, identifying the exact agent action responsible for that failure—a process known as failure attribution—is a significant challenge. This attribution is crucial for debugging and improving the system’s reliability.

Traditionally, failure attribution in MASs has been a labor-intensive and underexplored area. Manual analysis of extensive system logs is time-consuming and requires expert knowledge. More recent attempts have leveraged LLMs to diagnose failures, but these methods have shown limited success, often struggling with the sheer volume and noisy nature of MAS logs, leading to low accuracy in pinpointing the exact faulty action.

To address this critical gap, researchers have introduced Famas, a pioneering approach for automatically attributing failures in MASs. Famas is inspired by spectrum-based fault localization (SBFL), a technique commonly used in traditional software engineering. The core idea behind Famas is to estimate the likelihood that each agent action is responsible for a failure by analyzing variations across multiple repeated executions of a failed task.

How Famas Works: A Simplified Overview

Famas operates in two main phases: Trajectory Replay & Abstraction, and Spectrum Analysis.

First, in the **Trajectory Replay & Abstraction** phase, when a MAS fails on a task, Famas re-executes that task multiple times to collect a suite of raw execution logs. These logs, which are typically verbose and in natural language, are then processed. An LLM is used to break down these logs into manageable chunks, extracting primitive agent-action-state triples (who did what, and what was the result). A hierarchical clustering approach then refines these triples, consolidating semantically equivalent actions and states, even if described differently, to create a consistent and structured set of execution trajectories.

Second, the **Spectrum Analysis** phase takes these refined trajectories and performs a detailed analysis. Famas introduces a novel suspiciousness formula specifically tailored for MASs. This formula integrates two key groups of factors:

Agent Behavior Group: This considers how widely an action is distributed across an agent’s executions (Action Coverage Ratio) and how frequently a specific action appears relative to all actions performed by that agent (Action Frequency Proportion). These metrics help to fairly compare agents with different activity levels and identify core versus peripheral behaviors.
Action Behavior Group: This accounts for the repeatability of actions within MASs. It includes a Local Frequency Enhancement Factor, which amplifies suspicious actions that occur unusually often within a single failing trajectory, and a λ-Decay SBFL Coefficient, which captures global frequency patterns across multiple executions, distinguishing between actions consistently linked to failures and those that are merely common background operations.

By combining these metrics, Famas calculates a suspiciousness score for each agent-action-state triple. The triples are then ranked, with the highest-scoring one identified as the most likely root cause of the failure.

Also Read:

Remarkable Results and Generalizability

Famas was rigorously evaluated against 12 baseline methods on the Who&When benchmark, a comprehensive dataset of 184 failure traces from 127 MASs. The results were compelling: Famas consistently outperformed all compared methods, achieving significantly higher accuracy in failure attribution at both the agent and action levels. For instance, it improved action-level accuracy by 104.4% compared to a random approach and by 49.1% over the best LLM-based method.

Notably, Famas demonstrated strong generalizability, performing even better on more complex, handcrafted MAS logs compared to simpler, algorithmically generated ones. This suggests that the richer behavioral data in complex scenarios provides more robust spectral information for Famas to analyze. While the initial trajectory replay and abstraction phase can be computationally intensive, the subsequent spectrum analysis is highly efficient, making the overall process automated and tolerable for real-world applications.

This research marks a significant step forward in ensuring the reliability and debuggability of advanced multi-agent AI systems. For more in-depth information, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Famas: A New Approach to Pinpointing Failures in Multi-Agent AI Systems

How Famas Works: A Simplified Overview

Remarkable Results and Generalizability

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates