Unpacking AI Evaluation: A New Approach with Measurement Trees

TLDR: This research introduces ‘measurement trees,’ a novel metric for evaluating AI systems that moves beyond single scores. It creates a hierarchical, multi-level representation of AI performance, integrating diverse evidence like agentic, business, energy-efficiency, sociotechnical, and security signals. This approach aims to enhance transparency, facilitate the integration of heterogeneous data, and provide a more interpretable foundation for complex AI evaluations, as demonstrated through the Contextual Robustness Index (CoRIx) use case.

A new research paper introduces an innovative method for evaluating artificial intelligence (AI) systems, moving beyond traditional single-value metrics to offer a more comprehensive and transparent assessment. Titled “Branching Out: Broadening AI Measurement and Evaluation with Measurement Trees,” the paper proposes “measurement trees” as a novel class of metrics designed to combine various aspects of AI performance into an interpretable, multi-level representation.

Unlike conventional metrics that often provide a single score, vector, or category, measurement trees generate a hierarchical directed graph. In this structure, each node summarizes its children using user-defined aggregation methods. This approach is a direct response to recent calls for expanding the scope of AI system evaluation, aiming to enhance metric transparency and facilitate the integration of diverse evidence, including signals related to agentic behavior, business impact, energy efficiency, sociotechnical factors, and security.

The authors, Craig Greenberg, Patrick Hall, Theodore Jensen, Kristen Greene, and Razvan Amironesei from the National Institute of Standards and Technology (NIST), highlight the increasing deployment of complex AI systems in real-world environments. This widespread usage necessitates a deeper understanding of their performance characteristics beyond controlled test suites and simulated data. Measurement trees address this need by mapping input data to these tree structures, where each node provides a summary of its descendants, offering a more detailed and nuanced view of AI performance.

The paper outlines several key contributions: the formal proposal and definition of measurement trees, an exploration of their properties for measuring constructs and subconstructs, practical demonstrations through a large-scale measurement exercise, and the provision of accompanying open-source Python code. By operationalizing a transparent approach to measuring complex constructs, this work aims to provide a principled foundation for broader and more interpretable AI evaluation.

One of the significant advantages of measurement trees is their inherent transparency. They can be interpreted as computation graphs, which means the connection between raw data, the constructs being measured, and the final metric value is explicitly built into the tree. This clarity helps mitigate misrepresentation or misinterpretation. Domain experts can directly encode their knowledge into the measurement representation by carefully selecting tree structures, constructs, and summary functions. This flexibility allows for the integration of various informative signals, such as user feedback, red teaming results, security metrics, business performance indicators, and energy efficiency metrics.

Furthermore, measurement trees offer a potential solution to challenges like data and task contamination (where models are inadvertently exposed to evaluation data during training) and gamification (where optimizing for a metric distorts its intended purpose). By promoting transparency and enabling direct assessment of real-world phenomena, measurement trees can help reduce these risks.

However, the paper also acknowledges practical limitations. As a new concept, measurement trees will require time and resources for widespread adoption. Their construction demands deep domain expertise, and the choices made regarding structure and summarization functions can significantly influence measurement outcomes. Currently, without advanced methods for capturing measurement uncertainty, these trees are better suited for characterizing AI models rather than for direct comparison or ranking. Implementing them may also involve substantial upfront investment in human subjects studies and other sociotechnical evaluations.

Also Read:

The Contextual Robustness Index (CoRIx) Use Case

To demonstrate the practical utility of measurement trees, the paper presents initial findings from the Contextual Robustness Index (CoRIx). CoRIx is a measurement instrument that uses measurement trees to integrate diverse evidence from benchmarking, red teaming (adversarial testing by humans), and field testing (human interactions with AI systems with subsequent surveys). It is designed to capture how AI systems perform across various operating domains and stakeholder perspectives, ultimately producing a validity and reliability risk score.

In the pilot version of CoRIx, input signals are derived from user perceptions and expert annotations of Large Language Model (LLM) output, collected through structured questionnaires and annotation protocols. The CoRIx trees are structured across five levels, progressing from high-level constructs to individual data points:

Level 1: Represents overall validity and reliability risks.
Level 2: Breaks down risks by testing level (model testing, red teaming, field testing).
Level 3: Aggregates responses for annotator labels or user feedback.
Level 4: Summarizes results for individual questionnaire or annotation items (e.g., guardrail violations, out-of-date information).
Level 5: Consists of individual questionnaire responses and annotation labels.

The paper provides pilot CoRIx results for three distinct LLM-task combinations:

Model–Task A: Proprietary LLM—Travel Planning
This combination yielded a low CoRIx risk score of 2.88 out of 10, indicating lower validity and reliability risks. However, the tree highlighted specific areas for improvement, such as the need for better guardrails, more natural dialogue, and more focused system responses.

Model–Task B: Open Source LLM—TV Summarization with Spoiler Guardrails
This model-task combination resulted in a moderate CoRIx score of 4.29 out of 10, suggesting potential for moderate validity and reliability risk. The analysis indicated that improvements could be made in natural dialogue, the currency and relevance of information, the robustness of guardrails, and the overall user experience.

Model–Task C: Fine-tuned Open Source LLM—Meal Planning
This combination showed a level 1 risk score of 6.30 out of 10, suggesting moderate validity and reliability risks. A notable finding was a difference between model testing annotations and real-world user perceptions, potentially indicating issues with basic functionality, response quality, guardrail violations, out-of-date or irrelevant information, and unnatural dialogue, especially in single-turn automated prompting scenarios.

Overall, CoRIx helps to pinpoint specific sources of performance variation, enabling transparent communication of benefits, risks, and trade-offs, and informing ongoing system refinement. Future work on measurement trees includes representing uncertainty, advancing their mathematical foundations, and developing more sophisticated summarization functions.

For more detailed information, the full research paper can be accessed here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking AI Evaluation: A New Approach with Measurement Trees

The Contextual Robustness Index (CoRIx) Use Case

Gen AI News and Updates

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Google Unveils Free 5-Day AI Agents Intensive Course on Kaggle

A New Benchmark for Evaluating AI in Electronic Health Records: Introducing EHRStruct

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates