TLDR: This research paper, “Measuring What Matters: Connecting AI Ethics Evaluations to System Attributes, Hazards, and Harms,” critically examines existing AI ethics evaluation practices. It finds that current measures are fragmented, often lack validity, and focus on isolated system components rather than the AI system as a whole. The authors propose a system safety engineering framework to link measures to observable system attributes, potential hazards, and real-world harms. Their analysis of nearly 800 measures reveals a disproportionate focus on fairness, transparency, privacy, and trust, primarily assessing models and outputs, and an uneven representation of harm types. The paper concludes by emphasizing the need for a more holistic, systems-level approach to AI ethics evaluation to better identify, prevent, and respond to sociotechnical harms.
As Artificial Intelligence (AI) systems become increasingly integrated into our daily lives, especially in critical areas, the question of whether these systems adhere to ethical principles and how we evaluate their potential for harm has become paramount. A recent research paper titled “Measuring What Matters: Connecting AI Ethics Evaluations to System Attributes, Hazards, and Harms” delves into the current state of AI ethics evaluations, revealing significant gaps and proposing a more holistic approach.
The authors, Shalaleh Rismani, Renee Shelby, Leah Davis, Negar Rostamzadeh, and AJung Moon, highlight that while a vast ecosystem of measures has emerged to assess the social and ethical implications of AI, these measures are often developed and used in fragmented ways. They frequently fail to adequately consider how they fit within the broader AI system, leading to an incomplete understanding of potential harms.
Current Evaluation Shortcomings
The paper identifies two primary limitations in current AI ethics evaluation practices. Firstly, many existing measures lack what is known as ‘construct validity’ and ‘reliability’. This means they don’t consistently or accurately capture the ethical qualities they intend to assess, potentially leading to misleading insights about an AI system’s behavior and performance.
Secondly, most measures focus on isolated components of an AI system, such as the model or the dataset, rather than evaluating the system as a whole or considering how different components interact. This is a critical oversight, as harms often emerge from the complex interactions between technical, human, and organizational elements within a system, even if each individual component appears to function correctly.
A System Safety Perspective
To address these shortcomings, the researchers advocate for a system safety engineering framework. This approach views AI systems as interconnected networks of components and actors. Within this framework, a ‘measure’ quantifies an ‘attribute’ (an observable property of a system component). When an attribute deviates from expected parameters, it signals a ‘hazard’ – a condition that creates the potential for a ‘harm’. Harm is defined as the adverse real-world experiences resulting from a system’s deployment and operation.
The paper emphasizes that measures should act as feedback mechanisms, signaling when system elements are not functioning within acceptable bounds, allowing for timely interventions to prevent harm.
Key Findings from the Review
The authors conducted a comprehensive review of nearly 800 measures from 257 academic articles. Their analysis revealed several key patterns:
- Uneven Focus on Principles: A disproportionate number of measures (around 90%) concentrate on just four AI ethics principles: fairness, transparency, privacy, and trust. Principles like dignity, responsibility, and sustainability are significantly underrepresented.
- Component-Level Assessment: The majority of measures assess either the model or the output of an AI system, with far fewer looking at data/input or user-output interactions. This reinforces the problem of fragmented evaluation.
- Varied Harm Representation: While measures address all five types of sociotechnical harm (representational, allocative, quality of service, interpersonal, and social system harms), their distribution is uneven. For example, fairness measures often link to representational, allocative, or quality of service harms, while privacy measures are almost exclusively tied to interpersonal harms.
The paper details various types of harms:
- Representational Harm: Occurs when AI systems reinforce the subordination of social groups, often through imbalanced data representation or stereotypical associations in models.
- Allocative Harm: Happens when a system’s distribution of resources or opportunities adversely affects marginalized groups, such as disparate loan approvals or unequal error rates.
- Quality of Service Harm: Refers to AI systems disproportionately underperforming for certain groups, leading to less useful or satisfactory user experiences.
- Interpersonal Harm: Involves AI systems adversely shaping relations between people or communities, often through privacy violations, loss of agency, or diminished well-being.
- Social System Harm: Reflects macro-level adverse effects, such as systematizing bias, inequality, or excessive resource consumption (e.g., energy and carbon emissions).
Also Read:
- Unmasking Bias in AI’s Moral Compass Across Social Media
- Navigating Mental Health AI: A Framework for Safer Disclosure and Enhanced User Understanding
Challenges and Future Directions
The research highlights several critical challenges. Measures are often fragmented, taken far from where actual harm is experienced, and frequently lack clear criteria or thresholds for identifying hazards. Furthermore, most measures are taken at a single point in time, overlooking how harms can accumulate gradually. The role of perception-based measures (self-reported user experiences) also remains unclear, as they are subjective but crucial for understanding user well-being.
This work underscores the urgent need for the AI community to adopt more temporally responsive, systems-level approaches to evaluation. By explicitly connecting what is being measured to why it matters and how it relates to potential harms, evaluation practices can become more traceable and accountable. The authors provide a dataset and interactive visualization to support this effort, which can be explored further at the research paper’s URL.
Ultimately, the paper calls for a shift in how we approach AI ethics evaluations, moving beyond isolated checks to a comprehensive, integrated, and system-aware understanding of how harms emerge across complex sociotechnical systems.


