TLDR: A new research paper introduces an enhanced ensembling technique for Large Language Models (LLMs) that dramatically increases the trustworthiness of their responses. By allowing LLM ensembles to ‘abstain’ from answering when a dominant response doesn’t meet a variable voting threshold, the method significantly reduces hallucinations. This approach, validated in arithmetic and clinical data tasks, shows that while there might be a modest reduction in response yield and overall accuracy, the reliability of the provided answers is greatly improved, making LLMs more suitable for high-stakes applications like healthcare and data annotation.
Large Language Models (LLMs) have made incredible strides, but a persistent challenge remains: their tendency to confidently produce incorrect information, a phenomenon known as hallucination. This makes them difficult to trust in critical applications like healthcare, where accuracy is paramount. A new research paper, “INCREASING LLM RESPONSE TRUSTWORTHINESS USING VOTING ENSEMBLES”, introduces an innovative approach to tackle this issue by leveraging the power of collective intelligence through voting ensembles.
The core idea isn’t entirely new; ensembling, which involves gathering multiple responses and selecting the most frequent one, has been a known method to elicit more accurate answers. However, this paper expands on traditional ensembling by introducing a variable voting threshold. This means that an ensemble of LLMs can choose to “abstain” from providing an answer if the most dominant response doesn’t meet a predefined level of agreement among the agents. This simple yet powerful modification dramatically increases the trustworthiness of the answers that are ultimately provided.
To understand how this works, the researchers developed a theoretical framework that characterizes questions based on two key factors: deceptiveness and bewilderment. Deceptiveness refers to a question’s tendency to mislead an agent into choosing a plausible but incorrect answer. Bewilderment, on the other hand, quantifies how much a question forces an agent to guess randomly. By understanding these characteristics, the framework can predict how an ensemble will behave under different voting strategies.
The performance of these voting ensembles is evaluated based on three criteria: accuracy (the probability of a correct response), trust (the probability that any consensus answer is correct), and yield (the probability that a question receives a consensus answer). The findings show a clear trade-off: while increasing the voting threshold (making the ensemble more restrictive) can lead to a slight reduction in overall response yield and accuracy, it significantly boosts the trustworthiness of the answers that are given. This makes the approach particularly valuable for fields requiring a high degree of certainty, even if it means not every question receives an automated answer.
The theoretical results were validated through experiments in two distinct domains: arithmetic problem-solving and clinical-note question-answering. In arithmetic, ensembles of the Llama3-70B-instruct model were tasked with multi-digit multiplication and order-of-operations problems. The results consistently demonstrated that while ensembling didn’t always improve accuracy over a single model, it substantially increased trust, especially with more restrictive voting. Even when using Chain-of-Thought (CoT) prompting, which generally improves LLM performance, the benefits of the ensembling approach for trust remained evident.
For clinical applications, Llama3-8B-instruct models were used to extract specific patient characteristics from echocardiogram reports. Here too, ensembling led to stable or marginally improved accuracy, but most notably, a significant increase in trust across all features extracted. For instance, the trust for Left Ventricular Ejection Fraction (LVEF) increased from 0.94 to 0.98, and for Mitral Regurgitation (MR) from 0.70 to 0.93, with more restrictive voting schemes.
Also Read:
- How AI Models Express Their Confidence: A Look at Uncertainty in Argumentative Language Models
- Unlocking Latent Reasoning in LLMs with Temperature Scaling
This work highlights that voting ensembles offer a practical and effective method for quantifying and mitigating uncertainty in LLM responses. By allowing models to abstain when consensus is low, this approach provides a valuable tool for deploying LLMs in high-stakes environments where reliability is paramount, helping to bridge the gap between advanced AI capabilities and real-world trust requirements.


