Rethinking AGI Evaluation: From Simple Scores to Robust Intelligence Clusters

TLDR: A new research paper by Brett Reynolds proposes a novel framework for evaluating Artificial General Intelligence (AGI), moving beyond traditional snapshot scores and equal weighting of abilities. The paper introduces the concept of AGI as a ‘homeostatic property cluster,’ where intelligence is defined by a set of abilities and the mechanisms that maintain them under perturbation. It advocates for ‘centrality-prior scores’ that weight cognitive domains by their causal importance and a ‘Cluster Stability Index’ (CSI) family to measure profile persistence, durable learning, and error correction. This approach aims to distinguish genuine, robust AGI capabilities from brittle, scaffold-dependent performances, offering a more nuanced and harder-to-game evaluation system with significant implications for AGI governance.

Evaluating Artificial General Intelligence (AGI) is a complex challenge, and current methods, while advanced, still face significant limitations. A new research paper, titled “From Checklists to Clusters: A Homeostatic Account of AGI Evaluation,” by Brett Reynolds, proposes a fresh perspective that could revolutionize how we assess the true capabilities and robustness of AGI systems. The paper argues that AGI should be understood not just as a collection of abilities, but as a stabilized cluster of properties, much like how human intelligence functions.

Current AGI evaluations often treat all cognitive domains as equally important and rely on single-moment, or ‘snapshot,’ performance scores. This approach, while a step up from narrow benchmarks, creates two major issues. First, the ‘Equal-Weighting Problem’ assumes that all abilities, from processing speed to long-term memory, contribute symmetrically to general intelligence. However, human intelligence research suggests that some abilities are more foundational than others. For instance, a system with zero long-term memory might be fast, but can it truly be called generally intelligent if it can’t retain new information across sessions?

The Snapshot Problem: Beyond Temporary Performance

The second issue is the ‘Snapshot Problem.’ Imagine an AI scores high on a reasoning task, but when retested after a short delay without continuous context, its performance plummets. Was the initial high score a genuine capability, or merely a temporary performance boost from cached information or external scaffolds? This paper highlights that brittle, cached, or contorted capabilities can inflate AGI scores without demonstrating true general intelligence. Examples include models relying on massive context windows to ‘remember’ preferences or RAG (Retrieval Augmented Generation) systems that perform well only when the database is enabled, failing catastrophically without it.

Intelligence as a Homeostatic Property Cluster

To address these problems, Reynolds introduces the concept of a ‘homeostatic property cluster’ (HPC) to AGI evaluation. An HPC is a set of properties that tend to co-occur and are kept co-present by causal mechanisms. Biological species are a good example: they share properties because of mechanisms like reproduction and shared ancestry. Human intelligence fits this pattern, where abilities like reasoning, memory, and perception cluster within a person, maintained by neural development, synaptic consolidation, and metacognitive strategies.

If AGI is an HPC, then evaluation needs to test two things: the abilities themselves (can it reason, remember, perceive?) and the stability mechanisms (do those abilities persist under stress? Can the system compensate when one ability is degraded?). Current evaluations largely focus only on the first part.

Causal Centrality: Why Some Abilities Matter More

The paper argues that not all abilities contribute equally to the integrity of this intelligence cluster. Some abilities are ‘upstream’ (e.g., long-term storage enables cumulative learning), while others are ‘downstream’ (e.g., processing speed helps but isn’t constitutive). To reflect this, the paper proposes a ‘centrality-prior score’ that weights domains based on their causal importance, drawing insights from Cattell–Horn–Carroll (CHC) theory of human cognitive abilities. This means abilities like reasoning, comprehension-knowledge, and working memory would receive higher weights than, say, processing speed, reflecting their greater influence on overall intelligence.

Measuring Stability: The Cluster Stability Index (CSI)

To tackle the snapshot problem, the paper introduces a family of ‘Cluster Stability Indices’ (CSI) that measure different aspects of persistence:

Profile Stability Index (pCSI): Measures if the overall pattern of strengths and weaknesses across domains remains consistent even when testing conditions are perturbed (e.g., delays, scaffold removal, distribution shifts).
Durable Learning Index (dCSI): Assesses whether newly taught information (facts, procedures, concepts) is retained across sessions and delays without external support.
Error-Decay Index (eCSI): Evaluates the system’s ability to self-correct and reduce mistakes through iteration when given feedback or multiple attempts.

These three indices are combined using a geometric mean to create a single, comprehensive CSI. This multiplicative aggregation penalizes weaknesses in any single stability component more heavily, reflecting that all forms of stability are crucial for robust general intelligence.

Contortion vs. Compensation: The Role of Scaffolds

A critical distinction made in the paper is between ‘contorted’ and ‘compensatory’ scaffolds. A contorted scaffold (like an overly large context window that merely caches information) inflates snapshot scores but leads to catastrophic failure when removed. A compensatory scaffold (like a well-integrated RAG system that enhances existing retrieval strategies) genuinely extends capabilities, allowing for graceful degradation rather than collapse when degraded. The framework proposes operational tests to distinguish between these, ensuring that AGI evaluations measure the system’s inherent capabilities, not just its reliance on brittle external supports.

Operationalizing AGI Evaluation

The paper outlines concrete protocols for implementing this framework, even for black-box systems. This includes reporting both equal-weight and centrality-prior scores with sensitivity analysis, and a ‘stability battery’ of tests involving randomized perturbations, delayed retests for durable learning, and error-decay trials. This lottery-style sampling of perturbations is designed to prevent gaming, forcing systems to develop genuine robustness rather than optimizing for specific test instances.

Also Read:

Implications for AGI Governance

This multi-dimensional evaluation approach has significant implications for AGI governance. Instead of a single AGI score threshold, the framework suggests ‘multi-dimensional triggers’ that require both capability breadth (centrality-prior score) and stability depth (CSI and its components). This makes it much harder to game the system, as achieving high governance-relevant scores would necessitate broad, causally important capabilities that are robust and persistent under various stresses.

The framework presented in this paper offers a rigorous, transparent, and defensible way to evaluate AGI, moving beyond simple checklists to assess the true integrity and robustness of intelligent systems. For more details, you can read the full research paper here.