spot_img
HomeResearch & DevelopmentEvaluating LLM Robustness: A New Protocol for Multiple-Choice Question...

Evaluating LLM Robustness: A New Protocol for Multiple-Choice Question Assessment

TLDR: This research introduces a new protocol to assess how well different metrics evaluate Large Language Models (LLMs) on multiple-choice questions (MCQs), especially considering “answer fluctuation” – where LLMs give different answers to the same question with minor prompt changes. The study formalizes existing metrics and proposes a novel one called “worst accuracy,” which measures a model’s robustness by only counting an answer as correct if it’s consistently correct across all tested permutations. The findings show that most metrics correlate well with full fluctuation rates, and “worst accuracy” proves to be highly effective in reflecting both answer stability and original performance, particularly when evaluated with cyclic or larger random sets of permutations.

Large Language Models (LLMs) have become a cornerstone of modern AI, and accurately assessing their capabilities is crucial. Multiple-choice questions (MCQs) are widely used for this purpose due to their efficiency. However, evaluating LLMs on MCQs isn’t as straightforward as it might seem. A significant challenge arises from what researchers call “answer fluctuation” or “answer floating.” This phenomenon describes how LLMs can produce different answers to the same question when there are only slight, semantically insignificant changes in the prompt, such as reordering the answer options.

Understanding LLM Evaluation Challenges

Traditional metrics like BLEU or ROUGE, often used for text generation, are not suitable for MCQ tasks because of the high variation in possible correct answers. While human evaluation is an option, it’s often costly and subjective. This has led to the prevalence of MCQ benchmarks like ARC, GPQA, and BigBench-Hard for LLM assessment. Standard metrics like accuracy are commonly reported, but the research paper highlights that a thorough comparative analysis of various MCQ metrics has been lacking.

Previous studies have shown that LLMs are highly sensitive to changes in MCQ option order. For instance, simply rearranging proposed answers can elicit a different response from a model. The difference between a model’s best and worst performance due to option reordering can be substantial, sometimes as high as 70 percentage points for models like InstructGPT. This sensitivity extends to other factors like different option typography or even reversing the order of labels. Such instability raises concerns about the reliability of LLMs, especially in critical applications.

Introducing a New Metric: Worst Accuracy

Given the variety of metrics available for MCQ evaluation, the authors of this research paper formalize existing ones and propose a novel metric called “worst accuracy.” While metrics like average accuracy might give a general sense of performance, they can be misleading. An average accuracy of 0.5 could mean a model is consistently correct for half the questions, or it could mean it’s highly inconsistent, getting different answers for the same question across permutations. Worst accuracy addresses this by being a more stringent measure: it equals 1 only if a model answers correctly throughout all tested permutations for a given question. This metric is designed to provide a clearer indication of a model’s robustness and reliability.

The Proposed Assessment Protocol

To thoroughly evaluate these metrics, the researchers suggest a comprehensive assessment protocol. Since calculating all possible permutations for answer fluctuation is computationally expensive, the protocol aims to find metrics that can accurately represent full fluctuation rates using more cost-efficient subsets of permutations. The steps are as follows:

  1. Calculate the accuracy of models on the original benchmarks.
  2. Calculate the full fluctuation rates for each model and benchmark across all possible permutations of option order.
  3. Calculate various metrics (including existing ones and the new worst accuracy) using smaller, more efficient subsets of permutations.
  4. Determine the correlation (using R2 score) between these metrics and the full fluctuation rates.
  5. Determine the correlation between these metrics and the original accuracy.
  6. Finally, find the correlation between a metric and both full fluctuation rates and original accuracy simultaneously.

The experiments were conducted on 10 LLMs with parameter sizes below 10B, which are frequently used for fine-tuning. The benchmarks included well-known datasets like ARC-C, AGIEval, CSQA, MMLU, and Winogrande.

Also Read:

Key Findings from the Research

The study yielded several important insights:

  • Most existing metrics show a strong correlation with full fluctuation rates, even when calculated using only the original option order. Probability mass emerged as the best proxy among the tested metrics in this scenario.
  • Interestingly, for continuous metrics like probability mass and Brier score, adding more permutations (e.g., cyclic permutations) did not significantly improve their correlation with full fluctuation rates, suggesting that the original order might already capture sufficient information for these.
  • The novel metric, worst accuracy, demonstrated the highest correlation with full fluctuation rates when calculated using cyclic or larger random subsets of permutations.
  • When it’s crucial for the evaluation to represent both the original accuracy and the full fluctuation rates, worst accuracy showed the best overall performance, particularly with cyclic or larger random permutations.
  • Some metrics, like sensitivity gap and partial fluctuation rates, proved to be less stable when computed over very small subsets of permutations (e.g., just two random permutations). This indicates that the choice of permutation subset significantly impacts the reliability of these metrics.

This research underscores the importance of carefully selecting evaluation metrics for LLMs, especially when considering their robustness to minor prompt variations. The proposed protocol and the introduction of worst accuracy offer valuable tools for more reliable and comprehensive LLM assessment. For a deeper dive into the methodology and detailed results, you can access the full research paper here: Metric Assessment Protocol in the Context of Answer Fluctuation on MCQ Tasks.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -