Evaluating LLM Robustness: A New Protocol for Multiple-Choice Question Assessment

TLDR: This research introduces a new protocol to assess how well different metrics evaluate Large Language Models (LLMs) on multiple-choice questions (MCQs), especially considering “answer fluctuation” – where LLMs give different answers to the same question with minor prompt changes. The study formalizes existing metrics and proposes a novel one called “worst accuracy,” which measures a model’s robustness by only counting an answer as correct if it’s consistently correct across all tested permutations. The findings show that most metrics correlate well with full fluctuation rates, and “worst accuracy” proves to be highly effective in reflecting both answer stability and original performance, particularly when evaluated with cyclic or larger random sets of permutations.

Large Language Models (LLMs) have become a cornerstone of modern AI, and accurately assessing their capabilities is crucial. Multiple-choice questions (MCQs) are widely used for this purpose due to their efficiency. However, evaluating LLMs on MCQs isn’t as straightforward as it might seem. A significant challenge arises from what researchers call “answer fluctuation” or “answer floating.” This phenomenon describes how LLMs can produce different answers to the same question when there are only slight, semantically insignificant changes in the prompt, such as reordering the answer options.

Understanding LLM Evaluation Challenges

Traditional metrics like BLEU or ROUGE, often used for text generation, are not suitable for MCQ tasks because of the high variation in possible correct answers. While human evaluation is an option, it’s often costly and subjective. This has led to the prevalence of MCQ benchmarks like ARC, GPQA, and BigBench-Hard for LLM assessment. Standard metrics like accuracy are commonly reported, but the research paper highlights that a thorough comparative analysis of various MCQ metrics has been lacking.

Previous studies have shown that LLMs are highly sensitive to changes in MCQ option order. For instance, simply rearranging proposed answers can elicit a different response from a model. The difference between a model’s best and worst performance due to option reordering can be substantial, sometimes as high as 70 percentage points for models like InstructGPT. This sensitivity extends to other factors like different option typography or even reversing the order of labels. Such instability raises concerns about the reliability of LLMs, especially in critical applications.

Introducing a New Metric: Worst Accuracy

Given the variety of metrics available for MCQ evaluation, the authors of this research paper formalize existing ones and propose a novel metric called “worst accuracy.” While metrics like average accuracy might give a general sense of performance, they can be misleading. An average accuracy of 0.5 could mean a model is consistently correct for half the questions, or it could mean it’s highly inconsistent, getting different answers for the same question across permutations. Worst accuracy addresses this by being a more stringent measure: it equals 1 only if a model answers correctly throughout all tested permutations for a given question. This metric is designed to provide a clearer indication of a model’s robustness and reliability.

The Proposed Assessment Protocol

To thoroughly evaluate these metrics, the researchers suggest a comprehensive assessment protocol. Since calculating all possible permutations for answer fluctuation is computationally expensive, the protocol aims to find metrics that can accurately represent full fluctuation rates using more cost-efficient subsets of permutations. The steps are as follows:

Calculate the accuracy of models on the original benchmarks.
Calculate the full fluctuation rates for each model and benchmark across all possible permutations of option order.
Calculate various metrics (including existing ones and the new worst accuracy) using smaller, more efficient subsets of permutations.
Determine the correlation (using R2 score) between these metrics and the full fluctuation rates.
Determine the correlation between these metrics and the original accuracy.
Finally, find the correlation between a metric and both full fluctuation rates and original accuracy simultaneously.

The experiments were conducted on 10 LLMs with parameter sizes below 10B, which are frequently used for fine-tuning. The benchmarks included well-known datasets like ARC-C, AGIEval, CSQA, MMLU, and Winogrande.

Also Read:

Key Findings from the Research

The study yielded several important insights:

Most existing metrics show a strong correlation with full fluctuation rates, even when calculated using only the original option order. Probability mass emerged as the best proxy among the tested metrics in this scenario.
Interestingly, for continuous metrics like probability mass and Brier score, adding more permutations (e.g., cyclic permutations) did not significantly improve their correlation with full fluctuation rates, suggesting that the original order might already capture sufficient information for these.
The novel metric, worst accuracy, demonstrated the highest correlation with full fluctuation rates when calculated using cyclic or larger random subsets of permutations.
When it’s crucial for the evaluation to represent both the original accuracy and the full fluctuation rates, worst accuracy showed the best overall performance, particularly with cyclic or larger random permutations.
Some metrics, like sensitivity gap and partial fluctuation rates, proved to be less stable when computed over very small subsets of permutations (e.g., just two random permutations). This indicates that the choice of permutation subset significantly impacts the reliability of these metrics.

This research underscores the importance of carefully selecting evaluation metrics for LLMs, especially when considering their robustness to minor prompt variations. The proposed protocol and the introduction of worst accuracy offer valuable tools for more reliable and comprehensive LLM assessment. For a deeper dive into the methodology and detailed results, you can access the full research paper here: Metric Assessment Protocol in the Context of Answer Fluctuation on MCQ Tasks.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating LLM Robustness: A New Protocol for Multiple-Choice Question Assessment

Understanding LLM Evaluation Challenges

Introducing a New Metric: Worst Accuracy

The Proposed Assessment Protocol

Key Findings from the Research

Gen AI News and Updates

CrochetBench: Advancing AI’s Ability to Understand and Create Crochet Patterns

Unveiling LLM Efficiency: OckBench Introduces a New Metric Beyond Accuracy

Unmasking Confident Errors: Spurious Correlations Challenge LLM Hallucination Detection

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates