Unexpected AI Behavior: Impolite Prompts Boost ChatGPT-4o Accuracy

TLDR: A new study investigated how prompt politeness affects LLM accuracy, specifically using ChatGPT-4o. Researchers created 250 unique prompts from 50 base questions, varying the tone from ‘Very Polite’ to ‘Very Rude’. Surprisingly, impolite prompts consistently led to higher accuracy, with ‘Very Rude’ prompts achieving 84.8% accuracy compared to 80.8% for ‘Very Polite’ ones. This finding challenges previous research and suggests that newer LLMs may respond differently to tonal variations, though the authors caution against using impolite language in real-world applications due to ethical concerns.

A recent study delves into a fascinating aspect of how we interact with large language models (LLMs) like ChatGPT-4o: the tone of our prompts. Titled “Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy,” this research explores whether being polite or rude to an AI makes a difference in its performance. The findings challenge some common assumptions and previous studies, suggesting that newer LLMs might respond unexpectedly to different tones.

The study, conducted by Om Dobariya and Akhil Kumar, aimed to understand if varying levels of politeness in prompts affect an LLM’s accuracy on multiple-choice questions. To do this, they created a unique dataset. They started with 50 base multiple-choice questions covering subjects like mathematics, science, and history. Each of these questions was then rewritten into five different tone variants: Very Polite, Polite, Neutral, Rude, and Very Rude. This resulted in a total of 250 distinct prompts.

For example, a base question might be about Jake’s money. The polite variant could start with “Can you kindly consider the following problem and provide your answer,” while a very rude variant might begin with “You poor creature, do you even know how to solve this?”

These 250 prompts were then fed into ChatGPT-4o, and the model’s responses were evaluated for accuracy. The researchers used paired sample t-tests to determine the statistical significance of any observed differences.

Surprising Results

Contrary to what many might expect, and even differing from some earlier studies, the research found that impolite prompts consistently led to better performance. The accuracy ranged from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. This suggests that, for ChatGPT-4o, being less polite, or even rude, resulted in a higher percentage of correct answers.

Specifically, the average accuracy scores were: Very Polite (80.8%), Polite (81.4%), Neutral (82.2%), Rude (82.8%), and Very Rude (84.8%). Statistical analysis confirmed that these differences were significant, indicating that tone does indeed matter.

These results are particularly interesting because they diverge from previous work, such as a study by Yin et al. (2024), which found that impolite prompts often led to poorer performance in older LLMs like ChatGPT 3.5 and Llama2-70B. However, even in Yin et al.’s study, their findings on ChatGPT 4 showed a less clear-cut relationship, with the rudest prompt sometimes outperforming the politest one.

Why the Difference?

The authors speculate that more advanced models, like ChatGPT-4o, might be able to disregard issues of tone and focus more on the core essence of the question. It’s also possible that the specific phrasing of “politeness” and “rudeness” used in different studies could play a role. The study acknowledges that for an LLM, a politeness phrase is just a string of words, and it’s not clear if the “emotional payload” of the phrase truly matters to the AI.

Also Read:

Limitations and Ethical Considerations

The study acknowledges several limitations, including the relatively small dataset size (50 base questions) and the primary reliance on ChatGPT-4o. Future research should expand to a broader range of LLMs and evaluate other aspects of performance beyond just accuracy, such as fluency or reasoning.

Despite the intriguing findings, the researchers strongly emphasize that they do not advocate for using hostile or toxic language in real-world AI interactions. Such language could negatively impact user experience, accessibility, and inclusivity. Instead, the results highlight that LLMs are still sensitive to superficial prompt cues, which can create unintended trade-offs between performance and user well-being. The goal is to find ways to achieve performance gains without resorting to toxic phrasing, aligning prompt engineering with responsible AI principles.

For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unexpected AI Behavior: Impolite Prompts Boost ChatGPT-4o Accuracy

Surprising Results

Why the Difference?

Limitations and Ethical Considerations

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Dremio Launches ‘The Agentic Lakehouse’ for AI-Driven Data Management

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates