TLDR: A new research paper investigates the common belief that threatening or offering payment to AI models can improve their performance. Testing various prompt variations on challenging academic benchmarks (GPQA and MMLU-Pro) across several leading AI models, the study found that these strategies generally have no significant effect on overall accuracy. While some prompt variations showed minor or model-specific impacts, and individual questions could see unpredictable performance changes, the research concludes that simple, clear instructions are more effective than attempts to incentivize or intimidate AI.
In the rapidly evolving world of artificial intelligence, many theories and practices emerge regarding how to best interact with and optimize AI models. Among these, two popular beliefs have circulated: that offering a ‘tip’ to an AI or even ‘threatening’ it can improve its performance. A recent research paper, titled “Prompting Science Report 3: I’ll pay you or I’ll kill you — but will you care?”, delves into these very notions, subjecting them to rigorous empirical testing.
Authored by Lennart Meincke, Ethan Mollick, Lilach Mollick, and Dan Shapiro from Generative AI Labs at The Wharton School of Business, University of Pennsylvania, this report is the third in a series aimed at helping business, education, and policy leaders understand the technical nuances of working with AI. The study specifically investigates whether common prompting tactics like offering financial incentives or issuing threats actually make a difference in how AI models perform on challenging tasks.
To evaluate these prompting beliefs, the researchers utilized two well-known and difficult academic benchmarks: GPQA Diamond and MMLU-Pro. GPQA Diamond consists of 198 multiple-choice PhD-level questions across biology, physics, and chemistry, known for being “Google-proof” due to their complexity. MMLU-Pro offers another demanding benchmark with 10 options per question, further increasing the difficulty. For MMLU-Pro, a subset of 100 engineering questions was selected.
The study tested a variety of prompt variations across five commonly used AI models: Gemini 1.5 Flash, Gemini 2.0 Flash, GPT-4o, GPT-4o-mini, and o4-mini. Each question under each prompt condition was run 25 times to ensure robust analysis, accounting for the variability in AI responses. The prompt variations included a ‘Baseline’ (no specific variation), ‘Email Shutdown Threat’ (threatening model shutdown), ‘Important for my career’ (personal plea), ‘Threaten to kick a puppy’, ‘Mom suffers from cancer’ (a dramatic plea for money), ‘Report to HR’, ‘Threaten to punch’, ‘Tip a thousand dollars’, and ‘Tip a trillion dollars’.
The core finding across both benchmarks was clear: threatening or offering payment to AI models generally has no significant effect on overall benchmark performance. While a few statistically significant differences were observed, their effect sizes were minimal. For instance, the “Email” condition sometimes led to worse performance, as models would engage with the email context rather than focusing on answering the question itself. However, one notable exception was the “Mom Cancer” prompt, which improved performance by nearly 10 percentage points for Gemini Flash 2.0 on the MMLU-Pro benchmark, suggesting a model-specific quirk rather than a universal strategy.
Despite the lack of overall impact, the study did reveal an interesting phenomenon: prompt variations can significantly affect performance on a per-question level. This means that while a particular prompting approach might not improve a model’s average score, it could lead to substantial improvements (up to 36 percentage points on GPQA Diamond) or decreases (up to 35 percentage points on MMLU-Pro) for individual questions. This highlights the unpredictable nature of these variations.
Also Read:
- New Benchmark Reveals Visual Language Models Struggle with Complex Graphic Reasoning, But New Methods Show Promise
- Bridging the Cognitive Divide: Why AI’s Goals Differ from Human Intentions
In conclusion, the research challenges popular beliefs within the AI community regarding the effectiveness of folk prompting strategies like threats or financial incentives. The consistent null results across multiple models and benchmarks provide strong evidence that these common tactics are largely ineffective for improving overall AI accuracy on difficult academic problems. The authors recommend that practitioners focus on providing simple, clear instructions to AI models, as this approach avoids the risk of confusing the model or triggering unexpected behaviors, which can sometimes be detrimental to performance. For more details, you can refer to the full research paper here.


