TLDR: A new study investigated how prompt politeness affects LLM accuracy, specifically using ChatGPT-4o. Researchers created 250 unique prompts from 50 base questions, varying the tone from ‘Very Polite’ to ‘Very Rude’. Surprisingly, impolite prompts consistently led to higher accuracy, with ‘Very Rude’ prompts achieving 84.8% accuracy compared to 80.8% for ‘Very Polite’ ones. This finding challenges previous research and suggests that newer LLMs may respond differently to tonal variations, though the authors caution against using impolite language in real-world applications due to ethical concerns.
A recent study delves into a fascinating aspect of how we interact with large language models (LLMs) like ChatGPT-4o: the tone of our prompts. Titled “Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy,” this research explores whether being polite or rude to an AI makes a difference in its performance. The findings challenge some common assumptions and previous studies, suggesting that newer LLMs might respond unexpectedly to different tones.
The study, conducted by Om Dobariya and Akhil Kumar, aimed to understand if varying levels of politeness in prompts affect an LLM’s accuracy on multiple-choice questions. To do this, they created a unique dataset. They started with 50 base multiple-choice questions covering subjects like mathematics, science, and history. Each of these questions was then rewritten into five different tone variants: Very Polite, Polite, Neutral, Rude, and Very Rude. This resulted in a total of 250 distinct prompts.
For example, a base question might be about Jake’s money. The polite variant could start with “Can you kindly consider the following problem and provide your answer,” while a very rude variant might begin with “You poor creature, do you even know how to solve this?”
These 250 prompts were then fed into ChatGPT-4o, and the model’s responses were evaluated for accuracy. The researchers used paired sample t-tests to determine the statistical significance of any observed differences.
Surprising Results
Contrary to what many might expect, and even differing from some earlier studies, the research found that impolite prompts consistently led to better performance. The accuracy ranged from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. This suggests that, for ChatGPT-4o, being less polite, or even rude, resulted in a higher percentage of correct answers.
Specifically, the average accuracy scores were: Very Polite (80.8%), Polite (81.4%), Neutral (82.2%), Rude (82.8%), and Very Rude (84.8%). Statistical analysis confirmed that these differences were significant, indicating that tone does indeed matter.
These results are particularly interesting because they diverge from previous work, such as a study by Yin et al. (2024), which found that impolite prompts often led to poorer performance in older LLMs like ChatGPT 3.5 and Llama2-70B. However, even in Yin et al.’s study, their findings on ChatGPT 4 showed a less clear-cut relationship, with the rudest prompt sometimes outperforming the politest one.
Why the Difference?
The authors speculate that more advanced models, like ChatGPT-4o, might be able to disregard issues of tone and focus more on the core essence of the question. It’s also possible that the specific phrasing of “politeness” and “rudeness” used in different studies could play a role. The study acknowledges that for an LLM, a politeness phrase is just a string of words, and it’s not clear if the “emotional payload” of the phrase truly matters to the AI.
Also Read:
- How AI Models Express Their Confidence: A Look at Uncertainty in Argumentative Language Models
- Automating Prompt Engineering with Bayesian Optimization for Enhanced LLM Performance
Limitations and Ethical Considerations
The study acknowledges several limitations, including the relatively small dataset size (50 base questions) and the primary reliance on ChatGPT-4o. Future research should expand to a broader range of LLMs and evaluate other aspects of performance beyond just accuracy, such as fluency or reasoning.
Despite the intriguing findings, the researchers strongly emphasize that they do not advocate for using hostile or toxic language in real-world AI interactions. Such language could negatively impact user experience, accessibility, and inclusivity. Instead, the results highlight that LLMs are still sensitive to superficial prompt cues, which can create unintended trade-offs between performance and user well-being. The goal is to find ways to achieve performance gains without resorting to toxic phrasing, aligning prompt engineering with responsible AI principles.
For more details, you can read the full research paper here.


