TLDR: A new research paper introduces ‘e-scores’ as a novel method for assessing the correctness of generative model outputs, particularly from large language models (LLMs). Unlike traditional p-value-based methods, e-scores allow users to adaptively choose tolerance levels after observing the results, while still maintaining strong statistical guarantees against errors. This flexibility is achieved by controlling a post-hoc error notion called ‘size distortion.’ The paper demonstrates e-scores’ efficacy in evaluating mathematical reasoning and adherence to property constraints, offering a more robust and user-friendly approach to ensuring the reliability of AI-generated content.
Generative models, particularly large language models (LLMs), have become an integral part of our daily lives, powering everything from content creation to complex problem-solving. However, a significant challenge remains: reliably assessing the correctness of their outputs. LLMs are known to sometimes generate incorrect information, often referred to as ‘hallucinations,’ which necessitates robust mechanisms for evaluation.
Traditional methods for assessing LLM outputs often rely on a framework called conformal prediction, which uses p-values to construct sets of responses where the probability of including an incorrect answer is kept below a user-defined tolerance level. While effective, these p-value-based methods have a notable drawback: they are susceptible to ‘p-hacking.’ This means that if a user decides to adjust their tolerance level after already seeing the results, the statistical guarantees of the assessment can be invalidated. This limitation restricts the practical flexibility users often desire when interacting with generative models.
Introducing E-Scores for Flexible Assessment
A new research paper, E-Scores for (In)Correctness Assessment of Generative Model Outputs, proposes an innovative solution to this problem by introducing ‘e-scores.’ Developed by Guneet S. Dhillon, Javier González, Teodora Pandeva, and Alicia Curth from the University of Oxford and Microsoft Research, e-scores leverage the concept of e-values to provide a more flexible and statistically sound measure of incorrectness for generative model outputs.
The core idea behind e-scores is to offer the same strong statistical guarantees as p-scores, but with the crucial added benefit of allowing users to adaptively choose their tolerance levels even after observing the e-scores themselves. This is achieved by bounding a post-hoc notion of error called ‘size distortion,’ which quantifies the discrepancy between an observed error and the user’s chosen tolerance level.
How E-Scores Work
E-scores are designed to be low for correct responses and high for incorrect ones. They are calculated by comparing a test response’s value (derived from an ‘oracle estimator’ that predicts correctness) against the values of incorrect responses observed in a calibration dataset. The paper outlines several ways to transform the oracle estimator’s output, each yielding an e-score with a different range, and also provides a method to combine multiple e-scores for a more robust assessment.
A significant advantage of e-scores is their computational efficiency. Unlike p-scores, which require memory and time that grow linearly with the calibration data for each individual test response, e-scores involve a sum over the calibration data, leading to constant memory and linear time complexity that can be amortized across multiple test responses.
Experimental Validation
The researchers demonstrated the effectiveness of e-scores across two key experimental settings:
-
Mathematical Factuality: Using the ProcessBench benchmark, which evaluates LLMs on mathematical reasoning, e-scores were applied to assess the correctness of individual steps in an LLM’s solution. This setting highlighted how e-scores can identify specific points of error in complex reasoning chains.
-
Property Constraints Satisfaction: In this scenario, using the UltraFeedback dataset, e-scores were used to determine if LLM responses satisfied desirable properties like instruction-following, helpfulness, truthfulness, and honesty. This is crucial for ensuring that generative models produce outputs aligned with specific user requirements.
Across these experiments, e-scores consistently upheld their theoretical guarantees, reliably bounding the ‘size distortion’ by one. They also showed that the mean error was consistently lower than or approximately equal to the mean tolerance level. While maintaining high precision, e-scores demonstrated comparable precision-recall curves to p-scores, indicating their ability to effectively identify incorrect responses without excessively filtering out correct ones.
Also Read:
- Improving LLM Reliability Through Semantic Confidence Rewards
- Optimizing LLM Learning: A Structural Approach to Reasoning Trees
Broader Implications
The theoretical underpinnings of e-scores are robust, extending their applicability to any generative model and a broader range of response sets than previously considered. This opens up possibilities for more diverse applications and use-cases where flexible, statistically guaranteed assessment of generative model outputs is critical.
In conclusion, e-scores represent a significant step forward in the reliable assessment of generative models. By providing statistical guarantees that hold even when users adapt their tolerance levels post-hoc, e-scores offer a powerful and flexible tool for ensuring the quality and trustworthiness of AI-generated content.


