spot_img
HomeResearch & DevelopmentA New Approach to Quantifying Uncertainty in Large Language...

A New Approach to Quantifying Uncertainty in Large Language Models for Medical Diagnosis

TLDR: A new method called Approximate Bayesian Computation (ABC) is proposed to help Large Language Models (LLMs) better express their uncertainty, especially in critical applications like clinical diagnosis. Unlike existing methods that often produce overconfident and poorly calibrated predictions, ABC treats LLMs as simulators to infer robust probability distributions. Tested on clinical datasets, this approach significantly improves accuracy, reduces prediction errors, and enhances the reliability of LLM predictions, even when dealing with unfamiliar or quantized data.

Large Language Models (LLMs) are becoming increasingly prevalent in high-stakes fields like clinical decision-making. However, a significant challenge remains: their inability to reliably express uncertainty. This can be problematic when an LLM confidently provides an incorrect diagnosis, potentially leading to serious consequences. Traditional methods for quantifying LLM uncertainty, such as relying on model logits (raw output probabilities) or asking the model to self-report its confidence, often result in overconfident and poorly calibrated predictions.

Introducing Approximate Bayesian Computation (ABC)

A recent research paper, titled “UNCERTAINTY QUANTIFICATION OF LARGE LANGUAGE MODELS USING APPROXIMATE BAYESIAN COMPUTATION,” by Mridul Sharma, Adeetya Patel, Zaneta D’souza, Samira Abbasgholizadeh Rahimi, Siva Reddy, and Sreenath Madathil, proposes a novel solution: Approximate Bayesian Computation (ABC). This approach offers a principled Bayesian framework for understanding and quantifying the uncertainty in LLM predictions, even without needing to access the model’s internal workings or gradients. The core idea is to treat the LLM as a ‘stochastic simulator’ that can generate text based on a given hypothesis.

How ABC Works for Text Classification

In a typical text classification task, an LLM directly predicts a class label (e.g., a diagnosis) from an input text (e.g., patient symptoms). The ABC framework re-frames this. Instead of direct prediction, it asks: “How likely is a hypothesized health condition to have produced symptoms similar to those observed in a patient?”

The process involves several steps:

  • First, a candidate class label (a potential diagnosis) is sampled from a prior distribution.
  • Next, the LLM is prompted to generate a text description (simulated symptoms) conditioned on this candidate label.
  • Both the generated description and the actual patient’s description are then converted into numerical representations (embeddings).
  • The semantic similarity between these two embeddings is measured using a distance metric.
  • If the simulated description is sufficiently close to the observed one, the candidate label is ‘accepted’ as plausible.
  • This process is repeated many times, building an approximate posterior distribution over all possible class labels. This distribution then reflects the model’s uncertainty about the correct diagnosis.

The researchers utilized both a basic ABC rejection sampling method and a more advanced Sequential Monte Carlo ABC (SMC-ABC) to improve efficiency and refine the posterior distribution iteratively.

Significant Improvements in Clinical Benchmarks

The ABC approach was rigorously evaluated on two clinically relevant datasets: a synthetic oral lesion diagnosis dataset and the publicly available GretelAI Symptom-to-Diagnosis dataset. These datasets represent different levels of complexity and noise in clinical scenarios. The experiments involved several widely used LLMs, including Mistral-7B-Instruct-V3, Llama-3.1-8B-Instruct, and domain-specific models like Llam3-Med42-8B.

Compared to standard baselines (model logits and elicited probabilities), the ABC approach demonstrated remarkable improvements:

  • Accuracy increased by up to 46.9%.
  • Brier scores (a measure of prediction error) were reduced by 74.4%.
  • Calibration, as measured by Expected Calibration Error (ECE), improved significantly, with reductions of up to 87.9%.
  • The method also led to sharper and more confident predictive distributions, indicated by lower entropy levels.

These gains were consistent across both general-purpose and specialized medical LLMs, highlighting the robustness of the ABC framework. Furthermore, the ABC methods proved resilient to out-of-distribution (OOD) samples (unfamiliar cases) and variations in sampling temperature, expressing appropriate uncertainty where baselines often failed.

Addressing Computational Challenges and Limitations

While powerful, the ABC framework does come with a computational cost. It requires multiple LLM queries per instance, which can increase inference time. To mitigate this, the researchers proposed using a simpler ABC rejection sampling variant and a vector database approach where pre-generated class descriptions are stored as embeddings for efficient retrieval, transforming a generative task into a retrieval one.

The paper also acknowledges an edge case where ABC might struggle: when two conditions share many clinical features but differ by only a few rare, critical symptoms. In such scenarios, LLM-generated descriptions might omit these crucial discriminative features. Proposed solutions include increasing sampling diversity and re-framing prompts to encourage the LLM to generate comprehensive lists of clinical signs and symptoms.

Also Read:

A Path Towards More Trustworthy AI in Healthcare

This research marks a significant step towards making LLMs more reliable and trustworthy in critical applications like clinical diagnostics. By providing a principled way to quantify predictive uncertainty, the Approximate Bayesian Computation framework enables LLMs to not only make predictions but also to express how confident they are in those predictions, which is crucial for human decision-makers. For more details, you can read the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -