Examining Consistency in Small LLMs: Insights from Multiple-Choice Benchmarks

TLDR: A research paper by IBM Research Brazil investigates the consistency of small Large Language Models (LLMs, 2B-8B parameters) when answering the same multiple-choice questions repeatedly. The study, using MMLU-Redux and MedQA benchmarks, found that small LLMs typically provide consistent answers for only 50-80% of questions at low inference temperatures, while larger models (50B-80B) show much higher consistency (over 95%). The paper introduces a “consistency plot” for visualization and highlights the importance of consistency for non-creative LLM applications like customer service and medical diagnosis.

A recent study from IBM Research Brazil delves into a critical, yet often overlooked, aspect of Large Language Models (LLMs): their consistency in providing answers. Titled “The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks,” the paper by Claudio Pinhanez, Paulo Cavalin, Cassia Sanctos, Marcelo Grave, and Yago Primerano sheds light on the unpredictable nature of smaller LLMs, particularly those with 2 to 8 billion parameters.

The core of this research addresses the non-deterministic behavior inherent in most Transformer-based LLMs, largely due to the random drawing processes used during inference. While this non-determinism can be beneficial for creative applications, it poses significant challenges for non-creative uses where consistent and reliable responses are paramount. Imagine a customer service chatbot giving different answers to the same question, or an AI radiologist providing contradictory diagnoses for identical X-rays – such scenarios could lead to legal issues, erode trust, and even compromise safety.

Defining and Measuring Consistency

To quantify this phenomenon, the researchers introduced a clear definition of “answer consistency” by comparing an LLM’s performance to an “oracle machine” guessing at a certain success rate. For their experiments, a model was deemed 0.99-consistent if it answered at least 9 out of 10 repetitions of a multiple-choice question with the same choice. This objective criterion was applied to questions from standard benchmarks like MMLU-Redux (general knowledge) and MedQA (medical expertise).

The methodology involved repeatedly asking the same multiple-choice questions (10 times) to various open-source LLMs. They explored different inference temperatures (0.3, 0.7, and 1.0), compared small models (2B-8B parameters) with medium models (50B-80B parameters), and examined the impact of finetuning versus using base models. The study also introduced new analytical and graphical tools, notably the “consistency plot,” to visualize and compare consistency levels.

Key Findings: Small vs. Medium Models

The results from the MMLU-Redux benchmark revealed a stark difference between small and medium-sized LLMs. Small models typically produced consistent answers for only 50% to 80% of questions, even at low inference temperatures (0.3). Interestingly, the accuracy of these consistent answers (Right When SURE, or RWS) generally correlated well with the model’s overall average accuracy. As the inference temperature increased, the percentage of consistently answered questions (SURE/Total, or S/T) tended to decrease, while the accuracy among the consistent answers (RWS) often improved.

In contrast, medium-sized models demonstrated significantly higher levels of consistency, with S/T percentages ranging from 96% to 99% at their best temperatures. This suggests that the challenge of answer consistency is predominantly an issue for smaller LLMs. The consistency plots visually reinforced this, showing medium models clustered in the “USEFUL” quadrant, indicating both high consistency and high accuracy when consistent.

Finetuning and Medical Expertise

The MedQA benchmark experiments focused on the impact of finetuning. The study compared finetuned medical LLMs with their respective base models. As expected, finetuned models generally exhibited better accuracy and consistency than their base counterparts. For instance, the medllama3-v20 model achieved an impressive 96% S/T at 0.3 temperature, with a 0.75 RWS. Similar to the MMLU-Redux findings, increasing temperature in MedQA also led to a decrease in the percentage of SURE questions, though the RWS tended to remain stable or decrease slightly for most models.

The consistency plots for MedQA also offered valuable insights, illustrating how the two best finetuned medical models showed a clear path of improvement from their base versions, moving towards higher consistency and accuracy. This highlights the potential of finetuning to enhance reliability in specialized domains.

The Consistency Plot: A New Visualization Tool

A notable contribution of this paper is the introduction of the consistency plot. This graphical tool maps the percentage of SURE answers (S/T) on the X-axis and the ratio of right answers among SURE questions (RWS) on the Y-axis. The overall accuracy of the model is represented by the area of a circle. This visualization helps categorize models into four quadrants: “USEFUL” (high consistency, high accuracy), “RELIABLE but inconsistent” (low consistency, high accuracy), “CONSISTENT but unreliable” (high consistency, low accuracy), and “USELESS” (low consistency, low accuracy).

Also Read:

Looking Ahead

While the study provides crucial insights, the authors acknowledge limitations, including the exclusive use of multiple-choice benchmarks and top-K sampling decoding, as well as the potential for benchmark contamination in larger models. Future work aims to address these limitations by exploring non-multiple-choice contexts, using equivalent wordings for questions, and developing more efficient, non-repetitive methods for determining runtime consistency.

This research underscores the importance of understanding and improving answer consistency in LLMs, particularly as these models become more integrated into critical applications. For more detailed information, you can access the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Examining Consistency in Small LLMs: Insights from Multiple-Choice Benchmarks

Defining and Measuring Consistency

Key Findings: Small vs. Medium Models

Finetuning and Medical Expertise

The Consistency Plot: A New Visualization Tool

Looking Ahead

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

A New Way to Disentangle Data for Scientific Exploration

SiegPath Honored with ‘Most Innovative Fintech Award’ at AI Expo Europe 2025 for AI-Driven Solutions

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates