Measuring AI Confidence: How Sampling Frequency Improves Reliability in Multiple-Choice Questions

TLDR: This research introduces a frequency-based Predictive Entropy method combined with Conformal Prediction to quantify uncertainty in Large Language Models (LLMs) for multiple-choice question answering, especially in black-box settings. By repeatedly sampling LLM outputs and using the most frequent answer as a reference, the method effectively distinguishes correct from incorrect predictions and controls miscoverage rates, demonstrating that sampling frequency can reliably substitute inaccessible logit-based probabilities for enhanced LLM trustworthiness.

Large Language Models, or LLMs, have made incredible strides in answering multiple-choice questions. However, their widespread use in critical areas like healthcare and finance is often limited by their tendency to “hallucinate” – generating plausible but incorrect information – and their overconfidence in wrong answers. This inherent unreliability poses a significant challenge for deploying LLMs in high-stakes environments.

To tackle this, a new research paper introduces an innovative method for quantifying uncertainty in LLMs, especially when their internal workings are hidden, a scenario often referred to as “black-box” settings. The approach leverages a statistical framework called Conformal Prediction (CP), which is known for providing reliable confidence intervals and uncertainty measures for machine learning predictions. What makes CP particularly valuable is its ability to offer guarantees without needing specific assumptions about the data’s distribution, ensuring user-specified coverage probabilities, and adapting to any pre-trained model.

The core of this new method, detailed in the paper “Conformal Sets in Multiple-Choice Question Answering under Black-Box Settings with Provable Coverage Guarantees”, is a “frequency-based Predictive Entropy (PE)”. Instead of relying on internal “logit” scores, which are often inaccessible in black-box LLMs, this technique involves repeatedly sampling the model’s output for each question. The answer that appears most frequently across these samples is then used as a reference point. The idea is intuitive: if the model’s outputs are highly consistent and cluster around one answer, its confidence is high (low uncertainty). Conversely, if the outputs are widely dispersed, it indicates higher uncertainty.

The researchers conducted extensive experiments across six different LLMs, including Vicuna and Qwen models, and four diverse datasets: MedMCQA, MedQA, MMLU, and MMLU-Pro. These datasets cover a range of subjects from medical questions to general knowledge, providing a robust testing ground for the proposed method.

The results were compelling. The frequency-based PE consistently outperformed the traditional logit-based PE in distinguishing between correct and incorrect predictions, as measured by AUROC (Area Under the Receiver Operating Characteristic Curve). For instance, on the MedMCQA dataset, the frequency-based method showed a 2% higher AUROC value with the Qwen2.5-3B-Instruct model compared to the logit-based method. Furthermore, the method effectively controlled the empirical miscoverage rate – the proportion of times the prediction set failed to include the correct answer – keeping it within user-specified risk levels. This crucial finding validates that sampling frequency can indeed serve as a reliable substitute for logit-based probabilities in scenarios where LLMs are treated as black boxes.

Also Read:

In essence, this work provides a robust and model-agnostic framework for quantifying uncertainty in multiple-choice question answering, even when the internal mechanisms of LLMs are hidden. By enhancing the trustworthiness of LLMs through provable coverage guarantees, this research paves the way for their safer and more reliable application in practical, high-risk domains.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Measuring AI Confidence: How Sampling Frequency Improves Reliability in Multiple-Choice Questions

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Avalara Secures $500 Million Investment from BlackRock to Propel AI-Powered Tax Automation

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates