Unsupervised Evaluation: How Disagreeing Experts Can Enhance AI Safety

TLDR: This research paper introduces a novel logic for unsupervised evaluation of classifiers, particularly LLMs-as-Judges, without needing a ground truth. It leverages the logical consistency of agreement and disagreement patterns among experts, formalized through Linear Programming and universal axioms. The framework enables ‘no-knowledge alarms’ to detect when experts violate minimum performance thresholds, as demonstrated with the MT-Bench dataset. It also offers a method to critically assess ground truths established by disagreeing human experts, providing a minimum competency bound based on observed disagreements, thereby enhancing AI safety and evaluation rigor.

In the rapidly evolving landscape of artificial intelligence, a fundamental challenge persists: how do we evaluate AI systems, especially large language models (LLMs) acting as judges, when we lack a definitive ‘ground truth’ or answer key for their decisions? This age-old problem, often referred to as the principal/agent problem, is exacerbated by AI, as we delegate increasingly complex tasks to machines without always knowing if their judgments are truly reliable.

A recent research paper, Logical Consistency Between Disagreeing Experts And Its Role In AI Safety, introduces a novel approach to this unsupervised evaluation dilemma. The core idea revolves around leveraging the logical consistency (or inconsistency) observed when multiple experts, whether human or AI, agree or disagree on a given test. The paper highlights a crucial asymmetry: while complete agreement among experts doesn’t exclude any possible evaluation, disagreement immediately tells us that not all experts can be entirely correct.

The Logic of Unsupervised Evaluation

The paper formalizes a logic for unsupervised evaluation specifically for classifiers – systems that categorize items into a finite set of labels. Instead of relying on a ground truth, this method focuses on computing the set of ‘group evaluations’ that are logically consistent with how experts are observed to agree and disagree. This involves taking statistical summaries of their aligned decisions and feeding them into a Linear Programming problem. Simple logical constraints, such as the number of correct responses not exceeding the total observed responses, are treated as inequalities. Additionally, the framework introduces ‘axioms’ – universally applicable linear equalities that apply to all finite tests.

This approach treats classifiers as ‘black-boxes,’ only considering summaries of their decisions without needing domain-specific or semantic information. This makes the logic universally applicable, particularly useful for evaluating LLMs-as-Judges, which often perform classification tasks, such as deciding preferences between two other LLM outputs (e.g., ‘model a,’ ‘model b,’ or ‘tie’).

No-Knowledge Alarms for Misaligned Classifiers

One of the immediate practical applications of this logical consistency framework is the creation of ‘no-knowledge alarms.’ These alarms can detect when one or more LLMs-as-Judges violate a minimum grading threshold specified by a user, even without knowing the true answers. The alarms are triggered when the observed patterns of agreement and disagreement among experts make it logically impossible for all experts to meet a certain performance standard.

The paper illustrates this with an example from the MT-Bench benchmark, which evaluates LLMs-as-Judges in pair comparisons. By analyzing the decisions of two graders (the ‘authors’ of the MT-Bench dataset and ‘gpt4’) on 25 pair comparisons, the framework can identify situations where, at certain assumed answer key summaries, one or both classifiers cannot possibly be performing above a certain accuracy threshold. For instance, in the example, an alarm set to trigger if any label performance was less than 50% correct would have been activated for this specific test, indicating a misalignment with human expert judgments on certain labels.

It’s important to note that these alarms are predicated on an accuracy threshold that is independent of specific labels or classifiers. The logic, by its nature, does not privilege any particular label or expert. It establishes that, given the observed data, there is no possible answer key that would allow all classifiers to meet the specified performance criteria simultaneously.

Questioning the Ground Truth

The research also delves into a profound implication: what if the problem isn’t the experts, but the assumed ‘ground truth’ itself? Logical consistency cannot magically reveal an unknown answer key or validate the scientific correctness of a test. However, it can provide a powerful tool for scrutinizing ground truths established by disagreeing human experts.

For example, if human experts agree only 80% of the time on a benchmark like MT-Bench, this logic can quantify what this disagreement implies for their minimum competency. It allows us to ask: ‘The ground truth established by these experts is logically consistent with them being x% or more correct on all the labels.’ This ‘inversion’ of the problem provides a minimum standard for scientific work that relies on answer keys derived from disagreeing experts, offering a surrogate for the ‘error’ inherent in such constructed ground truths.

Also Read:

Conclusion

While logical consistency alone cannot determine the cause of a problem or how to fix it, it offers a powerful, universally applicable method to detect when something is definitively wrong with logical certainty. These ‘no-knowledge alarms’ serve as simple, yet crucial, components within larger monitoring systems, enhancing the safety and reliability of decisions made using the judgments of disagreeing experts, whether human or AI.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unsupervised Evaluation: How Disagreeing Experts Can Enhance AI Safety

The Logic of Unsupervised Evaluation

No-Knowledge Alarms for Misaligned Classifiers

Questioning the Ground Truth

Conclusion

Gen AI News and Updates

Anthropic Reveals First AI-Orchestrated Cyber Espionage Campaign by Chinese State-Sponsored Group

Google Bolsters AI Agent Safeguards with Enhanced Safety Frameworks

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates