TLDR: A research paper by IBM Research Brazil investigates the consistency of small Large Language Models (LLMs, 2B-8B parameters) when answering the same multiple-choice questions repeatedly. The study, using MMLU-Redux and MedQA benchmarks, found that small LLMs typically provide consistent answers for only 50-80% of questions at low inference temperatures, while larger models (50B-80B) show much higher consistency (over 95%). The paper introduces a “consistency plot” for visualization and highlights the importance of consistency for non-creative LLM applications like customer service and medical diagnosis.
A recent study from IBM Research Brazil delves into a critical, yet often overlooked, aspect of Large Language Models (LLMs): their consistency in providing answers. Titled “The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks,” the paper by Claudio Pinhanez, Paulo Cavalin, Cassia Sanctos, Marcelo Grave, and Yago Primerano sheds light on the unpredictable nature of smaller LLMs, particularly those with 2 to 8 billion parameters.
The core of this research addresses the non-deterministic behavior inherent in most Transformer-based LLMs, largely due to the random drawing processes used during inference. While this non-determinism can be beneficial for creative applications, it poses significant challenges for non-creative uses where consistent and reliable responses are paramount. Imagine a customer service chatbot giving different answers to the same question, or an AI radiologist providing contradictory diagnoses for identical X-rays – such scenarios could lead to legal issues, erode trust, and even compromise safety.
Defining and Measuring Consistency
To quantify this phenomenon, the researchers introduced a clear definition of “answer consistency” by comparing an LLM’s performance to an “oracle machine” guessing at a certain success rate. For their experiments, a model was deemed 0.99-consistent if it answered at least 9 out of 10 repetitions of a multiple-choice question with the same choice. This objective criterion was applied to questions from standard benchmarks like MMLU-Redux (general knowledge) and MedQA (medical expertise).
The methodology involved repeatedly asking the same multiple-choice questions (10 times) to various open-source LLMs. They explored different inference temperatures (0.3, 0.7, and 1.0), compared small models (2B-8B parameters) with medium models (50B-80B parameters), and examined the impact of finetuning versus using base models. The study also introduced new analytical and graphical tools, notably the “consistency plot,” to visualize and compare consistency levels.
Key Findings: Small vs. Medium Models
The results from the MMLU-Redux benchmark revealed a stark difference between small and medium-sized LLMs. Small models typically produced consistent answers for only 50% to 80% of questions, even at low inference temperatures (0.3). Interestingly, the accuracy of these consistent answers (Right When SURE, or RWS) generally correlated well with the model’s overall average accuracy. As the inference temperature increased, the percentage of consistently answered questions (SURE/Total, or S/T) tended to decrease, while the accuracy among the consistent answers (RWS) often improved.
In contrast, medium-sized models demonstrated significantly higher levels of consistency, with S/T percentages ranging from 96% to 99% at their best temperatures. This suggests that the challenge of answer consistency is predominantly an issue for smaller LLMs. The consistency plots visually reinforced this, showing medium models clustered in the “USEFUL” quadrant, indicating both high consistency and high accuracy when consistent.
Finetuning and Medical Expertise
The MedQA benchmark experiments focused on the impact of finetuning. The study compared finetuned medical LLMs with their respective base models. As expected, finetuned models generally exhibited better accuracy and consistency than their base counterparts. For instance, the medllama3-v20 model achieved an impressive 96% S/T at 0.3 temperature, with a 0.75 RWS. Similar to the MMLU-Redux findings, increasing temperature in MedQA also led to a decrease in the percentage of SURE questions, though the RWS tended to remain stable or decrease slightly for most models.
The consistency plots for MedQA also offered valuable insights, illustrating how the two best finetuned medical models showed a clear path of improvement from their base versions, moving towards higher consistency and accuracy. This highlights the potential of finetuning to enhance reliability in specialized domains.
The Consistency Plot: A New Visualization Tool
A notable contribution of this paper is the introduction of the consistency plot. This graphical tool maps the percentage of SURE answers (S/T) on the X-axis and the ratio of right answers among SURE questions (RWS) on the Y-axis. The overall accuracy of the model is represented by the area of a circle. This visualization helps categorize models into four quadrants: “USEFUL” (high consistency, high accuracy), “RELIABLE but inconsistent” (low consistency, high accuracy), “CONSISTENT but unreliable” (high consistency, low accuracy), and “USELESS” (low consistency, low accuracy).
Also Read:
- Unmasking the Core Triggers of Hallucination in Gemma Language Models
- Understanding Large Language Models in Legal AI: A Deep Dive into Current Trends and Future Paths
Looking Ahead
While the study provides crucial insights, the authors acknowledge limitations, including the exclusive use of multiple-choice benchmarks and top-K sampling decoding, as well as the potential for benchmark contamination in larger models. Future work aims to address these limitations by exploring non-multiple-choice contexts, using equivalent wordings for questions, and developing more efficient, non-repetitive methods for determining runtime consistency.
This research underscores the importance of understanding and improving answer consistency in LLMs, particularly as these models become more integrated into critical applications. For more detailed information, you can access the full research paper here.


