A New Approach to Quantifying Uncertainty in Large Language Models for Medical Diagnosis

TLDR: A new method called Approximate Bayesian Computation (ABC) is proposed to help Large Language Models (LLMs) better express their uncertainty, especially in critical applications like clinical diagnosis. Unlike existing methods that often produce overconfident and poorly calibrated predictions, ABC treats LLMs as simulators to infer robust probability distributions. Tested on clinical datasets, this approach significantly improves accuracy, reduces prediction errors, and enhances the reliability of LLM predictions, even when dealing with unfamiliar or quantized data.

Large Language Models (LLMs) are becoming increasingly prevalent in high-stakes fields like clinical decision-making. However, a significant challenge remains: their inability to reliably express uncertainty. This can be problematic when an LLM confidently provides an incorrect diagnosis, potentially leading to serious consequences. Traditional methods for quantifying LLM uncertainty, such as relying on model logits (raw output probabilities) or asking the model to self-report its confidence, often result in overconfident and poorly calibrated predictions.

Introducing Approximate Bayesian Computation (ABC)

A recent research paper, titled “UNCERTAINTY QUANTIFICATION OF LARGE LANGUAGE MODELS USING APPROXIMATE BAYESIAN COMPUTATION,” by Mridul Sharma, Adeetya Patel, Zaneta D’souza, Samira Abbasgholizadeh Rahimi, Siva Reddy, and Sreenath Madathil, proposes a novel solution: Approximate Bayesian Computation (ABC). This approach offers a principled Bayesian framework for understanding and quantifying the uncertainty in LLM predictions, even without needing to access the model’s internal workings or gradients. The core idea is to treat the LLM as a ‘stochastic simulator’ that can generate text based on a given hypothesis.

How ABC Works for Text Classification

In a typical text classification task, an LLM directly predicts a class label (e.g., a diagnosis) from an input text (e.g., patient symptoms). The ABC framework re-frames this. Instead of direct prediction, it asks: “How likely is a hypothesized health condition to have produced symptoms similar to those observed in a patient?”

The process involves several steps:

First, a candidate class label (a potential diagnosis) is sampled from a prior distribution.
Next, the LLM is prompted to generate a text description (simulated symptoms) conditioned on this candidate label.
Both the generated description and the actual patient’s description are then converted into numerical representations (embeddings).
The semantic similarity between these two embeddings is measured using a distance metric.
If the simulated description is sufficiently close to the observed one, the candidate label is ‘accepted’ as plausible.
This process is repeated many times, building an approximate posterior distribution over all possible class labels. This distribution then reflects the model’s uncertainty about the correct diagnosis.

The researchers utilized both a basic ABC rejection sampling method and a more advanced Sequential Monte Carlo ABC (SMC-ABC) to improve efficiency and refine the posterior distribution iteratively.

Significant Improvements in Clinical Benchmarks

The ABC approach was rigorously evaluated on two clinically relevant datasets: a synthetic oral lesion diagnosis dataset and the publicly available GretelAI Symptom-to-Diagnosis dataset. These datasets represent different levels of complexity and noise in clinical scenarios. The experiments involved several widely used LLMs, including Mistral-7B-Instruct-V3, Llama-3.1-8B-Instruct, and domain-specific models like Llam3-Med42-8B.

Compared to standard baselines (model logits and elicited probabilities), the ABC approach demonstrated remarkable improvements:

Accuracy increased by up to 46.9%.
Brier scores (a measure of prediction error) were reduced by 74.4%.
Calibration, as measured by Expected Calibration Error (ECE), improved significantly, with reductions of up to 87.9%.
The method also led to sharper and more confident predictive distributions, indicated by lower entropy levels.

These gains were consistent across both general-purpose and specialized medical LLMs, highlighting the robustness of the ABC framework. Furthermore, the ABC methods proved resilient to out-of-distribution (OOD) samples (unfamiliar cases) and variations in sampling temperature, expressing appropriate uncertainty where baselines often failed.

Addressing Computational Challenges and Limitations

While powerful, the ABC framework does come with a computational cost. It requires multiple LLM queries per instance, which can increase inference time. To mitigate this, the researchers proposed using a simpler ABC rejection sampling variant and a vector database approach where pre-generated class descriptions are stored as embeddings for efficient retrieval, transforming a generative task into a retrieval one.

The paper also acknowledges an edge case where ABC might struggle: when two conditions share many clinical features but differ by only a few rare, critical symptoms. In such scenarios, LLM-generated descriptions might omit these crucial discriminative features. Proposed solutions include increasing sampling diversity and re-framing prompts to encourage the LLM to generate comprehensive lists of clinical signs and symptoms.

Also Read:

A Path Towards More Trustworthy AI in Healthcare

This research marks a significant step towards making LLMs more reliable and trustworthy in critical applications like clinical diagnostics. By providing a principled way to quantify predictive uncertainty, the Approximate Bayesian Computation framework enables LLMs to not only make predictions but also to express how confident they are in those predictions, which is crucial for human decision-makers. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

A New Approach to Quantifying Uncertainty in Large Language Models for Medical Diagnosis

Introducing Approximate Bayesian Computation (ABC)

How ABC Works for Text Classification

Significant Improvements in Clinical Benchmarks

Addressing Computational Challenges and Limitations

A Path Towards More Trustworthy AI in Healthcare

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Get Well and RhythmX AI Unite to Form GW RhythmX, Pioneering AI-Native Healthcare Intelligence

Arya Health Secures $18.2 Million to Revolutionize Post-Acute Care Administration with AI Agents

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates