The Hidden Vulnerability of LLMs: Performance Drops with Reworded Questions

TLDR: A new study reveals that while Large Language Model (LLM) rankings remain stable, their absolute performance significantly declines when benchmark questions are paraphrased. This suggests that current benchmark evaluations may overestimate LLMs’ real-world generalization capabilities and highlights their struggle with linguistic variability. The research emphasizes the need for robustness-aware evaluation methods that account for diverse question phrasings.

Large Language Models (LLMs) have made incredible strides in natural language processing, often showcasing impressive performance on standardized benchmarks like MMLU, ARC-C, and HellaSwag. These benchmarks provide a consistent way to compare models, but they typically present questions in a fixed, original wording. A recent study, available at https://arxiv.org/pdf/2509.04013, by Riccardo Lunardi, Vincenzo Della Mea, Stefano Mizzaro, and Kevin Roitero from the University of Udine, Italy, delves into a critical question: how robust are LLMs to linguistic variability, and how reliable are these benchmark-based evaluations in truly measuring a model’s capabilities?

The core issue is that real-world applications involve diverse ways users might phrase the same question or query. If an LLM struggles when a question is reworded, even slightly, its real-world applicability might be overestimated. This research systematically investigates this by generating various paraphrases of questions across six common benchmarks and measuring the impact on 34 state-of-the-art LLMs.

The Challenge of Linguistic Variability

The study addresses two main research questions: First, are benchmark-based evaluations reliable? Do LLM results change significantly when questions are replaced by simple paraphrases? Second, are LLMs robust to question paraphrases? Does rewording decrease their effectiveness, revealing limitations in their generalization abilities?

The findings are quite revealing. While the relative rankings of LLMs tend to remain stable even with paraphrased inputs, their absolute effectiveness scores decline significantly. This suggests that current benchmarks, relying on static question wordings, might be overestimating a model’s true generalization capabilities. It highlights a struggle with linguistic variability, raising concerns about how well these models can adapt to diverse real-world inputs.

Methodology: Paraphrasing and Evaluation

To conduct this study, the researchers used six well-known benchmarks: ARC-C, HellaSwag, MMLU, OpenBookQA, RACE, and SciQ. These cover a range of reasoning and knowledge domains. They excluded benchmarks with binary/ternary choices, highly technical questions, or those requiring natural language answers, to maintain a consistent multiple-choice evaluation framework.

Thirty-four diverse LLMs were evaluated, ranging from smaller, efficiency-focused models to large-scale architectures with billions of parameters, including the closed-source gpt4o-mini. To generate paraphrases, OpenAI’s GPT-4o mini model was used to create five alternative phrasings for each question, ensuring semantic integrity and avoiding negations. The original order of answer choices was preserved to isolate the impact of question phrasing.

Models were prompted in a zero-shot setting, meaning they received no examples and had to select an answer directly. The evaluation focused on top-1 token probability for answer selection, ensuring deterministic and reproducible results.

Key Findings: Consistency and Accuracy

When analyzing the consistency of model answers across paraphrased versions of the same question, the study found that only a minority of models consistently selected the same answer across all paraphrases. For most models, 15% to 30% of questions received two, three, or even four distinct answers across paraphrases. This indicates a substantial degree of response variability, even for state-of-the-art models, suggesting sensitivity to surface-level changes in question phrasing.

Interestingly, the relationship between accuracy and consistency varied with model size. Smaller, less capable models often showed a negative correlation, meaning they were consistent in their (often wrong) predictions. This could be due to over-simplicity or a limited understanding of semantics. In contrast, larger, more advanced models exhibited a strong positive correlation, indicating that as they become more accurate, they also become more robust and consistent in their answers across different phrasings.

The study also confirmed that the generated paraphrases effectively introduced meaningful linguistic variability. As more paraphrases were added, the number of distinct answers produced by models increased, demonstrating their sensitivity to these variations.

Impact on Model Rankings and Generalization

Despite the significant drop in absolute accuracy, the relative rankings of LLMs remained largely preserved when evaluated with paraphrased questions. This means that while models perform worse overall, the ‘best’ models generally remain the ‘best’, and the ‘worst’ remain the ‘worst’. However, the majority of models experienced performance degradation, highlighting their limited robustness to linguistic changes.

An intriguing observation was the correlation between benchmark release dates and model performance on paraphrased questions. Older benchmarks tended to show more models whose accuracy on original questions was higher than on paraphrased ones. This suggests a potential for models to overfit or be contaminated by data from older benchmarks during pretraining, leading to inflated scores on original formulations. Newer benchmarks appeared less affected, possibly due to being less likely to be part of training data or being designed to resist shallow memorization.

Also Read:

Moving Towards Robust Evaluation

In conclusion, this research underscores that while current benchmark evaluations provide a comparative measure of models, they often overestimate absolute performance and generalization abilities. The significant drop in accuracy when questions are paraphrased reveals a critical limitation in LLMs’ ability to handle natural linguistic variation. The study advocates for a shift towards robustness-aware evaluation methodologies that incorporate diverse linguistic inputs, moving beyond static, rigid benchmarks to better capture the complexities of real-world language understanding.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

The Hidden Vulnerability of LLMs: Performance Drops with Reworded Questions

The Challenge of Linguistic Variability

Methodology: Paraphrasing and Evaluation

Key Findings: Consistency and Accuracy

Impact on Model Rankings and Generalization

Moving Towards Robust Evaluation

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates