TLDR: This research paper introduces a novel logic for unsupervised evaluation of classifiers, particularly LLMs-as-Judges, without needing a ground truth. It leverages the logical consistency of agreement and disagreement patterns among experts, formalized through Linear Programming and universal axioms. The framework enables ‘no-knowledge alarms’ to detect when experts violate minimum performance thresholds, as demonstrated with the MT-Bench dataset. It also offers a method to critically assess ground truths established by disagreeing human experts, providing a minimum competency bound based on observed disagreements, thereby enhancing AI safety and evaluation rigor.
In the rapidly evolving landscape of artificial intelligence, a fundamental challenge persists: how do we evaluate AI systems, especially large language models (LLMs) acting as judges, when we lack a definitive ‘ground truth’ or answer key for their decisions? This age-old problem, often referred to as the principal/agent problem, is exacerbated by AI, as we delegate increasingly complex tasks to machines without always knowing if their judgments are truly reliable.
A recent research paper, Logical Consistency Between Disagreeing Experts And Its Role In AI Safety, introduces a novel approach to this unsupervised evaluation dilemma. The core idea revolves around leveraging the logical consistency (or inconsistency) observed when multiple experts, whether human or AI, agree or disagree on a given test. The paper highlights a crucial asymmetry: while complete agreement among experts doesn’t exclude any possible evaluation, disagreement immediately tells us that not all experts can be entirely correct.
The Logic of Unsupervised Evaluation
The paper formalizes a logic for unsupervised evaluation specifically for classifiers – systems that categorize items into a finite set of labels. Instead of relying on a ground truth, this method focuses on computing the set of ‘group evaluations’ that are logically consistent with how experts are observed to agree and disagree. This involves taking statistical summaries of their aligned decisions and feeding them into a Linear Programming problem. Simple logical constraints, such as the number of correct responses not exceeding the total observed responses, are treated as inequalities. Additionally, the framework introduces ‘axioms’ – universally applicable linear equalities that apply to all finite tests.
This approach treats classifiers as ‘black-boxes,’ only considering summaries of their decisions without needing domain-specific or semantic information. This makes the logic universally applicable, particularly useful for evaluating LLMs-as-Judges, which often perform classification tasks, such as deciding preferences between two other LLM outputs (e.g., ‘model a,’ ‘model b,’ or ‘tie’).
No-Knowledge Alarms for Misaligned Classifiers
One of the immediate practical applications of this logical consistency framework is the creation of ‘no-knowledge alarms.’ These alarms can detect when one or more LLMs-as-Judges violate a minimum grading threshold specified by a user, even without knowing the true answers. The alarms are triggered when the observed patterns of agreement and disagreement among experts make it logically impossible for all experts to meet a certain performance standard.
The paper illustrates this with an example from the MT-Bench benchmark, which evaluates LLMs-as-Judges in pair comparisons. By analyzing the decisions of two graders (the ‘authors’ of the MT-Bench dataset and ‘gpt4’) on 25 pair comparisons, the framework can identify situations where, at certain assumed answer key summaries, one or both classifiers cannot possibly be performing above a certain accuracy threshold. For instance, in the example, an alarm set to trigger if any label performance was less than 50% correct would have been activated for this specific test, indicating a misalignment with human expert judgments on certain labels.
It’s important to note that these alarms are predicated on an accuracy threshold that is independent of specific labels or classifiers. The logic, by its nature, does not privilege any particular label or expert. It establishes that, given the observed data, there is no possible answer key that would allow all classifiers to meet the specified performance criteria simultaneously.
Questioning the Ground Truth
The research also delves into a profound implication: what if the problem isn’t the experts, but the assumed ‘ground truth’ itself? Logical consistency cannot magically reveal an unknown answer key or validate the scientific correctness of a test. However, it can provide a powerful tool for scrutinizing ground truths established by disagreeing human experts.
For example, if human experts agree only 80% of the time on a benchmark like MT-Bench, this logic can quantify what this disagreement implies for their minimum competency. It allows us to ask: ‘The ground truth established by these experts is logically consistent with them being x% or more correct on all the labels.’ This ‘inversion’ of the problem provides a minimum standard for scientific work that relies on answer keys derived from disagreeing experts, offering a surrogate for the ‘error’ inherent in such constructed ground truths.
Also Read:
- A Cost-Effective Way to Measure LLM Intelligence: The Consistency Score
- SafeEvalAgent: A Dynamic Approach to AI Safety Evaluation
Conclusion
While logical consistency alone cannot determine the cause of a problem or how to fix it, it offers a powerful, universally applicable method to detect when something is definitively wrong with logical certainty. These ‘no-knowledge alarms’ serve as simple, yet crucial, components within larger monitoring systems, enhancing the safety and reliability of decisions made using the judgments of disagreeing experts, whether human or AI.


