TLDR: A new research paper reviews 65 user studies on Explainable AI (XAI) systems, revealing a critical gap: current evaluations are often too technical and don’t adequately focus on human user needs. The study proposes a human-centered framework for XAI design and evaluation, categorizing metrics into core system (affection, cognition, usability, interpretability), explanation, and user levels. It expands design goals for AI novices (responsible use, acceptance, user experience) and data experts (human-AI collaboration, task performance), and offers guidelines for improving evaluation practices through validation, holistic assessment, and clear methodology. The aim is to ensure AI systems are not just performant but also understandable and trustworthy for all users.
As Artificial Intelligence (AI) becomes an increasingly common part of our daily lives, there’s a growing need for these intelligent systems to be not only powerful but also easy to understand. This is where Explainable AI (XAI) comes in, aiming to provide clear explanations for how AI makes its decisions and predictions. However, a recent comprehensive review highlights that the way we currently evaluate these XAI systems often focuses too much on technical aspects and not enough on the actual needs of the people using them.
A new research paper, titled On the Design and Evaluation of Human-Centered Explainable AI Systems: A Systematic Review and Taxonomy, by Aline Mangold, Juliane Zietz, Susanne Weinhold, and Sebastian Pannasch, delves into this issue. The authors conducted a thorough review of 65 user studies that evaluated XAI systems across various fields and applications. Their goal was to create a holistic guide for XAI developers, offering an overview of XAI system properties and evaluation metrics that are truly human-centered.
Understanding XAI Systems and Their Explanations
The review found that most XAI systems are designed for analysis, helping users understand data in different domains, from assessing social media content to image classification. These systems often provide explanations that are graphical (like charts and graphs) or textual (written or spoken). Many explanations focus on specific outputs, known as ‘local’ explanations, and frequently use ‘feature-based’ approaches, which highlight the most important factors influencing an AI’s prediction.
A key finding from the paper is the distinction between the ‘core system’ (the AI itself) and the ‘XAI explanation’ (how the AI’s decisions are communicated). Both are crucial for a complete understanding. Explanations can also be interactive, allowing users to ask for more details, or static, simply providing information without further interaction.
Human-Centered Evaluation: Beyond Technical Metrics
The researchers categorized human-centered evaluation metrics into four main areas for the core system: Affection, Cognition, Usability, and Interpretability. They also looked at metrics specifically for explanations and user characteristics.
- Affection: This refers to users’ emotional engagement and feelings towards the system. ‘Trust’ was the most frequently measured metric here, as it’s vital for users to rely appropriately on AI. The paper points out a lack of validated questionnaires for measuring trust, suggesting the use of established scales like the Trust Scale for Explainable AI (TXAI).
- Cognition: This covers the mental processes users employ to understand the system. ‘Understandability’ was a key metric, focusing on how easy users perceive the system to be.
- Usability: This assesses how effectively and intuitively users can interact with the system. ‘Usefulness’ and ‘intention to use’ were highlighted as important indicators of whether users will adopt the system.
- Interpretability: The paper redefined ‘transparency’ to focus on users’ perceived clarity of how specific features influence AI decisions, rather than just technical details.
Surprisingly, less than half of the reviewed papers actually evaluated the XAI explanation itself, focusing instead on the core system. The authors emphasize that ‘explanation usefulness’ and ‘explanation satisfaction’ are crucial metrics that need more attention.
User characteristics, such as ‘domain expertise’ (familiarity with the task or field the AI addresses), were also frequently assessed, as they significantly impact how users interact with and perceive XAI systems.
Extended Design Goals for Different User Groups
The paper extends existing design goals for two main user groups:
- AI Novices: For everyday users with little AI experience, the design goals include promoting ‘responsible use’ (understanding AI capabilities and risks), improving ‘acceptance’ (showing how the system aligns with user goals), and enhancing ‘user experience’ (making the system intuitive and easy to navigate).
- Data Experts: For professionals who use AI for analysis and decision-making, the goals are more performance-oriented. These include enhancing ‘human-AI collaboration’ (a bidirectional flow of information where humans and AI learn from each other) and improving ‘system and user task performance’ (equipping users with knowledge for optimal system use).
Also Read:
- Rethinking AI Ethics: Why Current Evaluation Methods Fall Short in Measuring Systemic Harms
- Unpacking the Black Box: How Argumentation Can Make Legal AI Transparent and Accountable
Guidelines for Better XAI Evaluation
Based on their findings, the authors provide several guidelines:
- Validation and Standardization: Use validated questionnaires and standardize user testing procedures to ensure accuracy and comparability of results.
- Presentation of Study Methodology: Clearly describe methods and constructs in research papers to enhance transparency and reproducibility.
- Holistic Evaluation: Consider all dimensions—affection, cognition, usability, interpretability, explanation, user characteristics, and user interaction behavior—for a comprehensive assessment.
- Evaluation of Explanations: Always integrate measures related to explanations to avoid incomplete evaluations.
- Consideration of Behavioral Intentions: Measure ‘intention to use’ in applied XAI studies, as it’s a direct predictor of actual usage behavior.
This research offers a valuable framework for developing and evaluating XAI systems from a human-centered perspective, ensuring that AI is not only intelligent but also understandable and trustworthy for all users.


