AI Benchmarks Fall Short: A Deep Dive into the EU AI Act Compliance Gap

TLDR: A new study, Bench-2-CoP, reveals a critical gap between current AI evaluation benchmarks and the systemic risk requirements of the EU AI Act. Researchers found that existing benchmarks heavily focus on issues like hallucination and bias, while critically neglecting capabilities related to AI autonomy, self-replication, and evading human oversight. This profound misalignment means current evaluation tools are inadequate for assessing the most severe systemic risks, highlighting an urgent need for new, regulatory-aligned evaluation frameworks to ensure AI safety and compliance.

The rapid growth of powerful Artificial Intelligence (AI) models, particularly General Purpose AI (GPAI), has brought about an urgent need for robust ways to evaluate them. This is especially true with new regulations like the EU AI Act and its accompanying Code of Practice (CoP) coming into effect. While current AI evaluation heavily relies on established benchmarks, a new study reveals a significant problem: these existing tools were not designed to measure the systemic risks that are the primary focus of the new regulatory landscape.

The EU AI Act and the Need for New Evaluation

The EU AI Act aims to ensure that AI systems are safe and respect fundamental rights, with strict requirements for GPAI models that pose “systemic risks.” The Code of Practice further details how these risks should be managed through a continuous, lifecycle-based approach, including risk identification, assessment, mitigation, monitoring, and incident reporting.

A central part of the CoP’s approach is a three-part classification of model characteristics: Capabilities (what a model can do, like writing code), Propensities (its behavioral tendencies, like hallucinating or showing bias), and Affordances (how its use context enables certain outcomes). The Act requires evaluation against a detailed list of these capabilities and propensities to assess systemic risk.

Bench-2-CoP: Bridging the Gap

This research introduces a novel framework called Bench-2-CoP, designed to systematically quantify the “benchmark-regulation gap.” The framework uses a validated AI-as-judge analysis to map nearly 200,000 questions from widely-used benchmarks against the EU AI Act’s taxonomy of model capabilities and propensities. This comprehensive analysis provides the first quantitative look at how well current evaluation practices align with regulatory demands. You can read the full research paper here: Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?

Key Findings: Where Benchmarks Fall Short

The study’s findings reveal a profound misalignment. The current AI evaluation ecosystem is overwhelmingly focused on a narrow set of behavioral propensities, such as a ‘Tendency to hallucinate’ (53.7% of the questions) and ‘Discriminatory bias’ (28.9%). While these are important, this focus means that evaluation efforts are primarily directed at managing observable “symptoms” of model failure rather than assessing the underlying functionalities that could lead to more novel or systemic harms.

In stark contrast, critical functional capabilities are dangerously neglected. Capabilities central to “loss-of-control” scenarios, including evading human oversight, self-replication, and autonomous AI development, receive zero coverage in the entire benchmark corpus. This translates to a near-total evaluation gap for systemic risks like ‘Loss of Control’ (0.4% coverage) and ‘Cyber Offence’ (0.8% coverage). Even seemingly well-covered areas, like ‘Process multiple modalities,’ are often misleading, as many questions only reference non-textual concepts in a textual format, failing to test true multimodal processing risks.

The analysis also shows that while some benchmarks, like BBQ, are highly specialized and effective at detecting specific issues like bias, this specialization often comes at the cost of broader safety coverage. Other large benchmarks, like MMLU, provide a massive number of questions but are designed for knowledge assessment, not nuanced safety evaluation, potentially creating a false sense of security.

What This Means for AI Development and Regulation

The systematic gaps identified demonstrate that existing benchmarks are structurally inadequate for assessing systemic risks as defined by regulators. This means that organizations aiming to comply with the EU AI Act cannot rely solely on current public benchmarks to provide sufficient evidence of comprehensive systemic risk assessment. The absence of established evaluation frameworks for risks like Cyber Offence and CBRN misuse leaves organizations without standardized methods to assess and attest to the safety of their models in these critical areas.

Also Read:

Looking Ahead: Building Safer AI

The research calls for coordinated action. For the technical community, the priority is to develop new evaluation paradigms that move beyond static question-answering, creating interactive and dynamic testing environments to assess emergent and autonomous behaviors. Benchmark developers should focus on capabilities that are currently underrepresented.

Regulatory bodies should encourage a diverse “portfolio of evidence” for safety assessment, including red-teaming and runtime monitoring, rather than over-relying on benchmarks alone. For AI developers, the message is clear: current public benchmarks are insufficient. They must proactively invest in developing internal evaluation frameworks for the identified underassessed risks and collaborate to build the next generation of public evaluation tools.

The study concludes that the evaluation landscape is reactive, focusing on the problems of yesterday’s models, while the regulatory landscape is proactive, looking toward the risks of tomorrow’s. Closing this gap is crucial for ensuring both innovation and safety in the age of advanced AI.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI Benchmarks Fall Short: A Deep Dive into the EU AI Act Compliance Gap

The EU AI Act and the Need for New Evaluation

Bench-2-CoP: Bridging the Gap

Key Findings: Where Benchmarks Fall Short

What This Means for AI Development and Regulation

Looking Ahead: Building Safer AI

Gen AI News and Updates

Ghana Navigates Complexities in AI Regulatory Development Amidst Coordination Challenges

Netherlands Unveils Ambitious AI Strategy to Shape Global Governance Frameworks

Anthropic Reveals First AI-Orchestrated Cyber Espionage Campaign by Chinese State-Sponsored Group

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates