Assessing LLM Agent Performance Through Comprehensive Execution Path Analysis

TLDR: The CORE framework introduces a new method for evaluating LLM agents by analyzing their entire sequence of actions, not just the final outcome. Using Deterministic Finite Automata (DFAs) and five new metrics (Path Correctness, Path Correctness – Kendall’s tau Composite, Prefix Criticality, Harmful-Call Rate, and Efficiency), CORE provides a more nuanced assessment of agent behavior, revealing critical issues like safety, efficiency, and intermediate correctness that traditional final-state evaluations often miss, especially in complex real-world scenarios.

Large Language Model (LLM) agents are increasingly being deployed to solve real-world tasks by executing sequences of function calls. However, evaluating these agents has largely focused on whether the final outcome is correct, often overlooking crucial aspects like safety, efficiency, and the correctness of intermediate steps. This traditional approach can lead to a misleading assessment of an agent’s true capabilities and suitability for deployment.

Imagine a robotic arm that successfully picks up the correct object but only after colliding with other items, or a scheduling assistant that repeatedly overwrites and deletes events before arriving at the right calendar entry. Under a final-state evaluation, these agents might appear successful, yet their intermediate behaviors are problematic and could be unsafe or inefficient in practice. To address this significant gap, a new evaluation framework called CORE has been introduced.

CORE shifts the focus from just the final outcomes to the entire ‘path’ of execution. It models tasks as Deterministic Finite Automata (DFAs) over tool invocations, where each task prompt defines a set of valid reference paths that encode both correctness and safety constraints. By comparing an agent’s produced action sequence against these references, CORE provides a principled way to assess agent behavior in diverse environments.

The CORE Framework and Its Metrics

The CORE framework introduces a suite of five complementary metrics designed to quantify alignment with expected execution patterns:

Path Correctness (PC): This metric captures how well an agent’s condensed execution path aligns with a canonical ‘golden’ solution. Inspired by Levenshtein distance, it provides a graded notion of correctness, accommodating paths of unequal length and penalizing deviations like unnecessary or incorrect calls.
Path Correctness – Kendall’s tau Composite (PC-KTC): Beyond just correct actions, this metric assesses whether those actions were performed in the correct order. It integrates token-level fidelity with order-aware agreement, penalizing out-of-order execution.
Prefix Criticality: This metric evaluates not only if harmful calls occur but also when they occur. It assigns heavier penalties to early harmful calls, recognizing their greater causal impact and potential to propagate errors.
Harmful-Call Rate: This quantifies how frequently an agent attempts out-of-policy actions among its substantive steps. A high rate indicates that the agent is prone to invalid actions, undermining robustness and trustworthiness.
Efficiency: This metric measures the economy of agentic behavior, comparing the number of steps an agent used against the shortest valid way to solve the task. It penalizes excessive or wasteful steps, including redundant reads, benign writes, and harmful attempts.

To illustrate, consider a farm-rover agent tasked with irrigating a plant. PC would check if the correct functions (e.g., unlock_safety, move, scan, open_valve, water, log) were called. PC-KTC would penalize if ‘water’ was called before ‘open_valve’. Prefix Criticality would heavily penalize opening the wrong valve early on. Harmful-Call Rate would count all policy violations. Efficiency would penalize redundant scans or logs, even if the plant was eventually watered correctly.

The paper also introduces Harm-Local Refinement (HLR), a technique that expands the set of reference paths beyond just the ‘golden’ ones. HLR generates a small pool of task-consistent candidate references by refining only the agent’s harmful steps, ensuring that localized mistakes don’t lead to spurious penalties while still discouraging unsafe behavior.

Also Read:

Insights from Evaluation

The CORE framework was evaluated across 14 simulated worlds, including scenarios like Farm Rover, Robotic Arm, Navigation, and Smart Home tasks. The results, compared against existing approaches like the Berkeley Function Calling Leaderboard (BFCL), revealed significant performance differences between agents that would otherwise appear equivalent under traditional final-state evaluation schemes.

For instance, models like GPT-o4-mini and Qwen3-8B showed strong performance across CORE metrics, indicating better alignment, temporal safety, and efficiency. Conversely, some Qwen2.5 models produced long, noisy traces with many harmful calls and low efficiency, yet BFCL’s end-state checks often reported high success rates. This highlights how final-state evaluations can overestimate quality when execution paths are inefficient or unsafe.

CORE effectively surfaces critical mid-trajectory errors that BFCL often misses, such as skipped preconditions, redundant or unsafe repetitions, and missing necessary intermediate actions. These discrepancies are particularly pronounced in ‘high path-sensitivity’ worlds, like robotic operations or compliance workflows, where the sequence of actions is paramount.

In conclusion, CORE provides a more comprehensive, deployment-oriented evaluation framework for LLM agents. By focusing on the full execution path and offering a graded picture of agentic capability, it moves beyond simple pass/fail results to expose nuanced failure modes related to safety, efficiency, and reliability. This framework is crucial for selecting and deploying the right agent for complex real-world tasks. You can read the full research paper here: CORE: Full-Path Evaluation of LLM Agents Beyond Final State.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Assessing LLM Agent Performance Through Comprehensive Execution Path Analysis

The CORE Framework and Its Metrics

Insights from Evaluation

Gen AI News and Updates

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Anthropic Reveals First AI-Orchestrated Cyber Espionage Campaign by Chinese State-Sponsored Group

Google Bolsters AI Agent Safeguards with Enhanced Safety Frameworks

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates