TLDR: This research paper identifies critical flaws in how we currently evaluate Large Language Models’ (LLMs) ability to estimate their own uncertainty, especially concerning factual errors or “hallucinations.” Existing evaluation methods, heavily reliant on approximate correctness functions in question-answering tasks, are shown to be biased and inconsistent. The authors propose a suite of more robust evaluation techniques: using “exact correctness” for structured tasks like code generation, employing a “Mixture of Judges and Instructions” (SP-MoJI) to average assessments from multiple AI judges, and incorporating out-of-distribution and perturbation detection tasks. To synthesize results across varied experiments, they introduce an Elo rating system, offering a more objective and comprehensive ranking of uncertainty estimation methods. The paper concludes that these improved evaluation practices are essential for developing more reliable and trustworthy LLMs.
Large Language Models (LLMs) have become incredibly powerful, but they often suffer from a significant problem: generating text that sounds plausible but is factually incorrect or unsupported. These are often called hallucinations, and a specific type, known as confabulations, arises from the LLM’s predictive uncertainty. To make LLMs more reliable, it’s crucial to accurately estimate when they are uncertain about their generated text.
However, a recent research paper, “Addressing Pitfalls in the Evaluation of Uncertainty Estimation Methods for Natural Language Generation” by Mykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger, and Sepp Hochreiter, highlights significant flaws in how these uncertainty estimation methods are currently evaluated. Traditionally, these methods are tested by correlating uncertainty estimates with the correctness of generated text, primarily using question-answering (QA) datasets. The problem lies with the “approximate correctness functions” used in these evaluations, such as ROUGE, BLEU, or even other LLMs acting as judges. These functions often disagree substantially with each other, leading to inconsistent rankings of uncertainty estimation methods and potentially inflating their apparent performance.
The Problem with Current Evaluation
The authors point out that the correctness functions used in Natural Language Generation (NLG) are far more complex than in simpler classification tasks. They are parametric, relying on thresholds and n-gram parameters for metrics like ROUGE and BLEU, or the specific prompting and model choice for LLM-as-a-judge. This parametric nature introduces bias and variance into the correctness labels. For instance, different ROUGE variants show high agreement on short answers, but overall, n-gram based metrics and LLM-as-a-judge often disagree. This inconsistency means that the reported performance of an uncertainty estimation method can be “hacked” by simply selecting a favorable correctness function, making true progress difficult to ascertain.
Proposing More Robust Evaluation Methods
To address these pitfalls, the researchers propose several improvements:
1. Exact Correctness for Structured Tasks: Instead of relying on approximate metrics, they suggest using tasks where correctness can be verified deterministically and non-parametrically. Examples include code completion (verified by unit tests) and constrained text generation, where the output must adhere to specific rules. This eliminates the ambiguity of approximate correctness functions.
2. Selective Prediction using Mixture of Judges and Instructions (SP-MoJI): For tasks where approximate correctness is unavoidable (like QA), the paper introduces SP-MoJI. This involves making multiple calls to different LLM-as-a-judge variants (using different models, prompts, and sampling parameters) and averaging their correctness assessments. This marginalization significantly reduces the evaluation biases and variability inherent in a single judge’s assessment. The study shows that using even just four judges can halve the standard deviation of performance estimates.
3. Out-of-Distribution (OOD) and Perturbation Detection: Drawing inspiration from other machine learning fields, the authors advocate for evaluating uncertainty methods on OOD detection (identifying inputs outside the model’s training data distribution) and perturbation detection (identifying inputs that have been corrupted). These tasks provide robust risk indicators, as a good uncertainty method should signal higher uncertainty for such inputs.
Aggregating Results with Elo Rating
To provide a more objective and comprehensive summary of method performance across diverse experimental setups, the paper proposes using the Elo rating system, similar to how chess players are ranked. Each experiment (a specific model, dataset, and risk indicator) is treated as a “game” where methods compete. The Elo rating system iteratively updates scores based on pairwise comparisons, allowing for indirect comparisons and a unified view of performance, even when methods are not evaluated on all the same tasks. This helps to overcome the challenge of contradictory assessments often found in large tables of experimental results.
Also Read:
- Generalized Models for Accurate LLM Correctness Prediction
- Unsupervised Evaluation: How Disagreeing Experts Can Enhance AI Safety
Key Insights and Future Directions
The research confirms that there is no “one-size-fits-all” uncertainty estimation method for NLG; different tasks benefit from different approaches. Simple heuristic methods can be surprisingly competitive in many settings, especially outside the traditional QA domain. The study also notes that length normalization, a common practice, can be detrimental in most scenarios except for perturbation detection. The findings underscore the importance of robust evaluation protocols to truly advance uncertainty estimation in NLG, guiding the field toward more reliable and trustworthy LLM-based systems.


