TLDR: A new research paper proposes a “coherence-based measure” for Artificial General Intelligence (AGI), moving beyond simple arithmetic averages. This new metric, based on the Area Under the Curve (AUC) of generalized means, emphasizes balanced competence across all cognitive domains rather than allowing strengths to compensate for weaknesses. Applying it to models like GPT-4 and GPT-5 reveals that despite high average scores, these systems still exhibit significant imbalances and are further from true general intelligence than previously thought, offering a stricter and more accurate way to track AGI progress.
The quest for Artificial General Intelligence (AGI) – systems with human-like intelligence – has long been a central goal in AI research. However, defining and measuring progress towards AGI has proven challenging. Recent efforts, notably by Hendrycks et al. in 2025, proposed a framework that evaluates AGI by averaging proficiencies across ten cognitive domains, inspired by the Cattell–Horn–Carroll (CHC) model of human cognition.
While this arithmetic mean approach is straightforward, it operates on a crucial assumption: compensability. This means that exceptional performance in some areas can offset significant weaknesses in others. For instance, an AI could score highly overall even if it completely fails in critical domains like reasoning or memory. This perspective, however, clashes with how human intelligence and complex systems generally function, where abilities are interdependent and a single weak link can limit overall capability.
A new research paper, titled “A Coherence-Based Measure of AGI” by Fares Fourati, introduces a novel way to measure AGI that addresses this limitation. The paper argues that true general intelligence should reflect “coherent sufficiency” – a balanced competence across all essential domains, without catastrophic failures in any. This principle aligns with psychometric evidence, which shows that extreme imbalances in human cognitive faculties are often associated with functional impairment, not high general intelligence.
To formalize this idea, the paper proposes a coherence-aware measure of AGI based on the integral of generalized means over a continuum of compensability exponents. This might sound technical, but it essentially means moving beyond a simple average. The generalized mean is a mathematical tool that can be adjusted by an exponent ‘p’. When ‘p’ is 1, it’s the familiar arithmetic mean. As ‘p’ decreases, the measure increasingly penalizes imbalance among cognitive domains, shifting towards a non-compensatory evaluation where weaknesses have a greater impact.
The core of this new metric is the Area Under the Curve (AUC) of these generalized means. This AUC score quantifies the robustness of an AI system’s performance under varying assumptions about how much one ability can compensate for another. Unlike the arithmetic mean, which can reward specialization, the AUC penalizes imbalance and captures the crucial inter-domain dependency.
When applied to published CHC-based domain scores for advanced AI models like GPT-4 and GPT-5, the coherence-adjusted AUC reveals a significant difference. While GPT-5 might achieve a high arithmetic mean score (e.g., 58%), its coherence-adjusted AUC score is much lower (e.g., 24%). This indicates that despite impressive individual strengths, these systems still exhibit persistent structural imbalances and are far from genuine general competence. For example, domains like long-term memory storage or adaptive reasoning often remain at zero, acting as bottlenecks that the arithmetic mean can easily overlook.
The paper also highlights that this new coherence-based measure aligns more closely with challenging external reasoning benchmarks like ARC-AGI-2 and BIG-Bench Extra Hard. These benchmarks, which test out-of-distribution reasoning and cross-domain abstraction, show scores for GPT-4 and GPT-5 that are much closer to their coherence-adjusted AUC values than their arithmetic means. This suggests that the AUC measure more faithfully captures “functional coherence” – the ability to sustain competence across diverse cognitive domains.
Also Read:
- Unpacking AGI: A Framework for Human-Level AI Assessment
- PROBE: A New Benchmark Reveals LLM Agents Struggle with Proactive Problem Solving
In essence, this research suggests a shift in how we evaluate AGI progress. Instead of merely summing up capabilities, we should focus on ensuring their coherent sufficiency. Genuine progress towards AGI will not just be about higher average scores, but about achieving a flatter, consistently high performance across all cognitive abilities, indicating resilience and integrated capabilities. This stricter, more interpretable framework aims to guide AI development towards truly general and balanced intelligence. You can read the full paper here: A Coherence-Based Measure of AGI.


