TLDR: The CORE framework introduces a new method for evaluating LLM agents by analyzing their entire sequence of actions, not just the final outcome. Using Deterministic Finite Automata (DFAs) and five new metrics (Path Correctness, Path Correctness – Kendall’s tau Composite, Prefix Criticality, Harmful-Call Rate, and Efficiency), CORE provides a more nuanced assessment of agent behavior, revealing critical issues like safety, efficiency, and intermediate correctness that traditional final-state evaluations often miss, especially in complex real-world scenarios.
Large Language Model (LLM) agents are increasingly being deployed to solve real-world tasks by executing sequences of function calls. However, evaluating these agents has largely focused on whether the final outcome is correct, often overlooking crucial aspects like safety, efficiency, and the correctness of intermediate steps. This traditional approach can lead to a misleading assessment of an agent’s true capabilities and suitability for deployment.
Imagine a robotic arm that successfully picks up the correct object but only after colliding with other items, or a scheduling assistant that repeatedly overwrites and deletes events before arriving at the right calendar entry. Under a final-state evaluation, these agents might appear successful, yet their intermediate behaviors are problematic and could be unsafe or inefficient in practice. To address this significant gap, a new evaluation framework called CORE has been introduced.
CORE shifts the focus from just the final outcomes to the entire ‘path’ of execution. It models tasks as Deterministic Finite Automata (DFAs) over tool invocations, where each task prompt defines a set of valid reference paths that encode both correctness and safety constraints. By comparing an agent’s produced action sequence against these references, CORE provides a principled way to assess agent behavior in diverse environments.
The CORE Framework and Its Metrics
The CORE framework introduces a suite of five complementary metrics designed to quantify alignment with expected execution patterns:
- Path Correctness (PC): This metric captures how well an agent’s condensed execution path aligns with a canonical ‘golden’ solution. Inspired by Levenshtein distance, it provides a graded notion of correctness, accommodating paths of unequal length and penalizing deviations like unnecessary or incorrect calls.
- Path Correctness – Kendall’s tau Composite (PC-KTC): Beyond just correct actions, this metric assesses whether those actions were performed in the correct order. It integrates token-level fidelity with order-aware agreement, penalizing out-of-order execution.
- Prefix Criticality: This metric evaluates not only if harmful calls occur but also when they occur. It assigns heavier penalties to early harmful calls, recognizing their greater causal impact and potential to propagate errors.
- Harmful-Call Rate: This quantifies how frequently an agent attempts out-of-policy actions among its substantive steps. A high rate indicates that the agent is prone to invalid actions, undermining robustness and trustworthiness.
- Efficiency: This metric measures the economy of agentic behavior, comparing the number of steps an agent used against the shortest valid way to solve the task. It penalizes excessive or wasteful steps, including redundant reads, benign writes, and harmful attempts.
To illustrate, consider a farm-rover agent tasked with irrigating a plant. PC would check if the correct functions (e.g., unlock_safety, move, scan, open_valve, water, log) were called. PC-KTC would penalize if ‘water’ was called before ‘open_valve’. Prefix Criticality would heavily penalize opening the wrong valve early on. Harmful-Call Rate would count all policy violations. Efficiency would penalize redundant scans or logs, even if the plant was eventually watered correctly.
The paper also introduces Harm-Local Refinement (HLR), a technique that expands the set of reference paths beyond just the ‘golden’ ones. HLR generates a small pool of task-consistent candidate references by refining only the agent’s harmful steps, ensuring that localized mistakes don’t lead to spurious penalties while still discouraging unsafe behavior.
Also Read:
- A New Evaluation Framework for Generative Document Parsing Systems
- Unpacking SPEED: A New Approach to Evaluating Large Language Models
Insights from Evaluation
The CORE framework was evaluated across 14 simulated worlds, including scenarios like Farm Rover, Robotic Arm, Navigation, and Smart Home tasks. The results, compared against existing approaches like the Berkeley Function Calling Leaderboard (BFCL), revealed significant performance differences between agents that would otherwise appear equivalent under traditional final-state evaluation schemes.
For instance, models like GPT-o4-mini and Qwen3-8B showed strong performance across CORE metrics, indicating better alignment, temporal safety, and efficiency. Conversely, some Qwen2.5 models produced long, noisy traces with many harmful calls and low efficiency, yet BFCL’s end-state checks often reported high success rates. This highlights how final-state evaluations can overestimate quality when execution paths are inefficient or unsafe.
CORE effectively surfaces critical mid-trajectory errors that BFCL often misses, such as skipped preconditions, redundant or unsafe repetitions, and missing necessary intermediate actions. These discrepancies are particularly pronounced in ‘high path-sensitivity’ worlds, like robotic operations or compliance workflows, where the sequence of actions is paramount.
In conclusion, CORE provides a more comprehensive, deployment-oriented evaluation framework for LLM agents. By focusing on the full execution path and offering a graded picture of agentic capability, it moves beyond simple pass/fail results to expose nuanced failure modes related to safety, efficiency, and reliability. This framework is crucial for selecting and deploying the right agent for complex real-world tasks. You can read the full research paper here: CORE: Full-Path Evaluation of LLM Agents Beyond Final State.


