TLDR: This research introduces a frequency-based Predictive Entropy method combined with Conformal Prediction to quantify uncertainty in Large Language Models (LLMs) for multiple-choice question answering, especially in black-box settings. By repeatedly sampling LLM outputs and using the most frequent answer as a reference, the method effectively distinguishes correct from incorrect predictions and controls miscoverage rates, demonstrating that sampling frequency can reliably substitute inaccessible logit-based probabilities for enhanced LLM trustworthiness.
Large Language Models, or LLMs, have made incredible strides in answering multiple-choice questions. However, their widespread use in critical areas like healthcare and finance is often limited by their tendency to “hallucinate” – generating plausible but incorrect information – and their overconfidence in wrong answers. This inherent unreliability poses a significant challenge for deploying LLMs in high-stakes environments.
To tackle this, a new research paper introduces an innovative method for quantifying uncertainty in LLMs, especially when their internal workings are hidden, a scenario often referred to as “black-box” settings. The approach leverages a statistical framework called Conformal Prediction (CP), which is known for providing reliable confidence intervals and uncertainty measures for machine learning predictions. What makes CP particularly valuable is its ability to offer guarantees without needing specific assumptions about the data’s distribution, ensuring user-specified coverage probabilities, and adapting to any pre-trained model.
The core of this new method, detailed in the paper “Conformal Sets in Multiple-Choice Question Answering under Black-Box Settings with Provable Coverage Guarantees”, is a “frequency-based Predictive Entropy (PE)”. Instead of relying on internal “logit” scores, which are often inaccessible in black-box LLMs, this technique involves repeatedly sampling the model’s output for each question. The answer that appears most frequently across these samples is then used as a reference point. The idea is intuitive: if the model’s outputs are highly consistent and cluster around one answer, its confidence is high (low uncertainty). Conversely, if the outputs are widely dispersed, it indicates higher uncertainty.
The researchers conducted extensive experiments across six different LLMs, including Vicuna and Qwen models, and four diverse datasets: MedMCQA, MedQA, MMLU, and MMLU-Pro. These datasets cover a range of subjects from medical questions to general knowledge, providing a robust testing ground for the proposed method.
The results were compelling. The frequency-based PE consistently outperformed the traditional logit-based PE in distinguishing between correct and incorrect predictions, as measured by AUROC (Area Under the Receiver Operating Characteristic Curve). For instance, on the MedMCQA dataset, the frequency-based method showed a 2% higher AUROC value with the Qwen2.5-3B-Instruct model compared to the logit-based method. Furthermore, the method effectively controlled the empirical miscoverage rate – the proportion of times the prediction set failed to include the correct answer – keeping it within user-specified risk levels. This crucial finding validates that sampling frequency can indeed serve as a reliable substitute for logit-based probabilities in scenarios where LLMs are treated as black boxes.
Also Read:
- Evaluating Trust in AI: A New Benchmark for Multimodal Model Confidence
- Understanding AI Grading Uncertainty with Semantic Entropy
In essence, this work provides a robust and model-agnostic framework for quantifying uncertainty in multiple-choice question answering, even when the internal mechanisms of LLMs are hidden. By enhancing the trustworthiness of LLMs through provable coverage guarantees, this research paves the way for their safer and more reliable application in practical, high-risk domains.


