spot_img
HomeResearch & DevelopmentThe Hidden Vulnerability of LLMs: Performance Drops with Reworded...

The Hidden Vulnerability of LLMs: Performance Drops with Reworded Questions

TLDR: A new study reveals that while Large Language Model (LLM) rankings remain stable, their absolute performance significantly declines when benchmark questions are paraphrased. This suggests that current benchmark evaluations may overestimate LLMs’ real-world generalization capabilities and highlights their struggle with linguistic variability. The research emphasizes the need for robustness-aware evaluation methods that account for diverse question phrasings.

Large Language Models (LLMs) have made incredible strides in natural language processing, often showcasing impressive performance on standardized benchmarks like MMLU, ARC-C, and HellaSwag. These benchmarks provide a consistent way to compare models, but they typically present questions in a fixed, original wording. A recent study, available at https://arxiv.org/pdf/2509.04013, by Riccardo Lunardi, Vincenzo Della Mea, Stefano Mizzaro, and Kevin Roitero from the University of Udine, Italy, delves into a critical question: how robust are LLMs to linguistic variability, and how reliable are these benchmark-based evaluations in truly measuring a model’s capabilities?

The core issue is that real-world applications involve diverse ways users might phrase the same question or query. If an LLM struggles when a question is reworded, even slightly, its real-world applicability might be overestimated. This research systematically investigates this by generating various paraphrases of questions across six common benchmarks and measuring the impact on 34 state-of-the-art LLMs.

The Challenge of Linguistic Variability

The study addresses two main research questions: First, are benchmark-based evaluations reliable? Do LLM results change significantly when questions are replaced by simple paraphrases? Second, are LLMs robust to question paraphrases? Does rewording decrease their effectiveness, revealing limitations in their generalization abilities?

The findings are quite revealing. While the relative rankings of LLMs tend to remain stable even with paraphrased inputs, their absolute effectiveness scores decline significantly. This suggests that current benchmarks, relying on static question wordings, might be overestimating a model’s true generalization capabilities. It highlights a struggle with linguistic variability, raising concerns about how well these models can adapt to diverse real-world inputs.

Methodology: Paraphrasing and Evaluation

To conduct this study, the researchers used six well-known benchmarks: ARC-C, HellaSwag, MMLU, OpenBookQA, RACE, and SciQ. These cover a range of reasoning and knowledge domains. They excluded benchmarks with binary/ternary choices, highly technical questions, or those requiring natural language answers, to maintain a consistent multiple-choice evaluation framework.

Thirty-four diverse LLMs were evaluated, ranging from smaller, efficiency-focused models to large-scale architectures with billions of parameters, including the closed-source gpt4o-mini. To generate paraphrases, OpenAI’s GPT-4o mini model was used to create five alternative phrasings for each question, ensuring semantic integrity and avoiding negations. The original order of answer choices was preserved to isolate the impact of question phrasing.

Models were prompted in a zero-shot setting, meaning they received no examples and had to select an answer directly. The evaluation focused on top-1 token probability for answer selection, ensuring deterministic and reproducible results.

Key Findings: Consistency and Accuracy

When analyzing the consistency of model answers across paraphrased versions of the same question, the study found that only a minority of models consistently selected the same answer across all paraphrases. For most models, 15% to 30% of questions received two, three, or even four distinct answers across paraphrases. This indicates a substantial degree of response variability, even for state-of-the-art models, suggesting sensitivity to surface-level changes in question phrasing.

Interestingly, the relationship between accuracy and consistency varied with model size. Smaller, less capable models often showed a negative correlation, meaning they were consistent in their (often wrong) predictions. This could be due to over-simplicity or a limited understanding of semantics. In contrast, larger, more advanced models exhibited a strong positive correlation, indicating that as they become more accurate, they also become more robust and consistent in their answers across different phrasings.

The study also confirmed that the generated paraphrases effectively introduced meaningful linguistic variability. As more paraphrases were added, the number of distinct answers produced by models increased, demonstrating their sensitivity to these variations.

Impact on Model Rankings and Generalization

Despite the significant drop in absolute accuracy, the relative rankings of LLMs remained largely preserved when evaluated with paraphrased questions. This means that while models perform worse overall, the ‘best’ models generally remain the ‘best’, and the ‘worst’ remain the ‘worst’. However, the majority of models experienced performance degradation, highlighting their limited robustness to linguistic changes.

An intriguing observation was the correlation between benchmark release dates and model performance on paraphrased questions. Older benchmarks tended to show more models whose accuracy on original questions was higher than on paraphrased ones. This suggests a potential for models to overfit or be contaminated by data from older benchmarks during pretraining, leading to inflated scores on original formulations. Newer benchmarks appeared less affected, possibly due to being less likely to be part of training data or being designed to resist shallow memorization.

Also Read:

Moving Towards Robust Evaluation

In conclusion, this research underscores that while current benchmark evaluations provide a comparative measure of models, they often overestimate absolute performance and generalization abilities. The significant drop in accuracy when questions are paraphrased reveals a critical limitation in LLMs’ ability to handle natural linguistic variation. The study advocates for a shift towards robustness-aware evaluation methodologies that incorporate diverse linguistic inputs, moving beyond static, rigid benchmarks to better capture the complexities of real-world language understanding.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -