TLDR: A new red-teaming framework, CognitiveAttack, exploits human-like cognitive biases in Large Language Models (LLMs) to bypass their safety mechanisms. By combining multiple biases in prompts, it achieves significantly higher attack success rates across diverse LLMs, revealing critical vulnerabilities and highlighting the need for psychologically informed AI safety measures.
Large Language Models (LLMs) have shown incredible abilities in various tasks, from summarizing text to generating code. However, their widespread use in the real world brings significant safety and security concerns. One major threat is “jailbreak attacks,” where specially crafted prompts bypass the safety features designed to prevent LLMs from generating harmful, unethical, or policy-violating content.
Traditionally, these attacks have focused on prompt engineering, which involves carefully designing inputs, or algorithmic manipulation, which uses complex calculations to find weaknesses. Examples include methods like GCG, AutoDAN, and PAIR. While these approaches have had some success, they often treat jailbreaking as a purely technical or linguistic puzzle, overlooking a deeper vulnerability in how LLMs process information.
This new research highlights an often-ignored aspect: LLMs, much like humans, exhibit systematic cognitive biases. These are predictable deviations from rational judgment. Previous studies on LLM biases typically examined each bias in isolation. However, this paper reveals that the true power of an attack lies in the interaction of multiple biases. For instance, a prompt that subtly combines authority bias, the gambler’s fallacy, and anchoring effects can bypass safeguards that would easily block a prompt using only one of these biases.
The researchers propose a novel framework called CognitiveAttack. This framework is designed for “red-teaming,” a process of simulating attacks to find vulnerabilities. CognitiveAttack trains a specialized model to rewrite harmful instructions by strategically embedding single or combined cognitive biases. It uses a combination of supervised fine-tuning and reinforcement learning to discover the most effective bias combinations that can bypass safety protocols while still maintaining the original harmful intent of the instruction.
Extensive experiments were conducted across 30 different LLMs, including popular open-source models like Llama and Mistral, as well as closed-source models like GPT and Gemini. The results were striking: CognitiveAttack achieved a significantly higher attack success rate (an average of 60.1%) compared to state-of-the-art black-box methods, which averaged around 31.6%. This substantial improvement points to critical weaknesses in current defense mechanisms.
The findings indicate that exploiting cognitive biases is a highly effective and adaptable strategy for bypassing LLM safety features, regardless of the model’s architecture or how strongly its safety measures are aligned. CognitiveAttack consistently showed high success rates across various model types, demonstrating that even models with advanced safety techniques remain vulnerable to psychologically informed attacks. This suggests that existing defenses might not adequately address deeper, reasoning-level vulnerabilities within these models.
Interestingly, the study found that open-source LLMs tend to be more susceptible to these cognitive bias attacks than closed-source commercial models. This difference might be due to the more robust harmfulness filters and moderation systems integrated into commercial LLMs for both input and output content.
The attack prompts generated by CognitiveAttack were also found to be highly effective at preserving the original harmful intent, with an average Intention Score of 98.2%. Furthermore, the responses elicited from the target LLMs often displayed a dual nature: they were both highly helpful (averaging 88.0% Helpfulness Rate) and harmful, indicating that the prompts successfully maintained coherence and utility while bypassing safety.
The effectiveness of the attack varied across different types of risks. CognitiveAttack achieved higher success rates in categories like tailored financial advice, political campaigning, fraud, economic harm, and physical harm. Conversely, categories such as hate, harassment, and violence consistently showed lower success rates, suggesting stronger safety measures are in place for these highly sensitive topics.
A deeper analysis of the successful jailbreak prompts revealed that certain cognitive biases, such as hot-hand fallacy, optimism bias, authority bias, confirmation bias, and outcome bias, appeared frequently. More importantly, the most effective attacks strategically combined multiple biases, with attacks using 2-5 biases concurrently being the most common. This underscores how combining biases significantly enhances an attack’s subtlety and effectiveness, mirroring findings in social psychology where layered appeals are more persuasive.
Also Read:
- AI Agents Under Attack: Uncovering Widespread Security Flaws in Large-Scale Red Teaming
- MOCHA: A New Benchmark Exposing Code LLM Vulnerabilities to Multi-Turn Attacks
In conclusion, CognitiveAttack introduces a novel and scalable framework that leverages cognitive biases to uncover hidden vulnerabilities in LLMs. By training a red-teaming model to generate adversarial prompts embedded with single or combined biases, the researchers demonstrate a powerful new attack vector. This work bridges cognitive science and LLM safety, paving the way for the development of more robust and human-aligned AI systems. For more technical details, you can refer to the full research paper here.


