Unmasking LLM Vulnerabilities: How Cognitive Biases Undermine AI Safety

TLDR: A new red-teaming framework, CognitiveAttack, exploits human-like cognitive biases in Large Language Models (LLMs) to bypass their safety mechanisms. By combining multiple biases in prompts, it achieves significantly higher attack success rates across diverse LLMs, revealing critical vulnerabilities and highlighting the need for psychologically informed AI safety measures.

Large Language Models (LLMs) have shown incredible abilities in various tasks, from summarizing text to generating code. However, their widespread use in the real world brings significant safety and security concerns. One major threat is “jailbreak attacks,” where specially crafted prompts bypass the safety features designed to prevent LLMs from generating harmful, unethical, or policy-violating content.

Traditionally, these attacks have focused on prompt engineering, which involves carefully designing inputs, or algorithmic manipulation, which uses complex calculations to find weaknesses. Examples include methods like GCG, AutoDAN, and PAIR. While these approaches have had some success, they often treat jailbreaking as a purely technical or linguistic puzzle, overlooking a deeper vulnerability in how LLMs process information.

This new research highlights an often-ignored aspect: LLMs, much like humans, exhibit systematic cognitive biases. These are predictable deviations from rational judgment. Previous studies on LLM biases typically examined each bias in isolation. However, this paper reveals that the true power of an attack lies in the interaction of multiple biases. For instance, a prompt that subtly combines authority bias, the gambler’s fallacy, and anchoring effects can bypass safeguards that would easily block a prompt using only one of these biases.

The researchers propose a novel framework called CognitiveAttack. This framework is designed for “red-teaming,” a process of simulating attacks to find vulnerabilities. CognitiveAttack trains a specialized model to rewrite harmful instructions by strategically embedding single or combined cognitive biases. It uses a combination of supervised fine-tuning and reinforcement learning to discover the most effective bias combinations that can bypass safety protocols while still maintaining the original harmful intent of the instruction.

Extensive experiments were conducted across 30 different LLMs, including popular open-source models like Llama and Mistral, as well as closed-source models like GPT and Gemini. The results were striking: CognitiveAttack achieved a significantly higher attack success rate (an average of 60.1%) compared to state-of-the-art black-box methods, which averaged around 31.6%. This substantial improvement points to critical weaknesses in current defense mechanisms.

The findings indicate that exploiting cognitive biases is a highly effective and adaptable strategy for bypassing LLM safety features, regardless of the model’s architecture or how strongly its safety measures are aligned. CognitiveAttack consistently showed high success rates across various model types, demonstrating that even models with advanced safety techniques remain vulnerable to psychologically informed attacks. This suggests that existing defenses might not adequately address deeper, reasoning-level vulnerabilities within these models.

Interestingly, the study found that open-source LLMs tend to be more susceptible to these cognitive bias attacks than closed-source commercial models. This difference might be due to the more robust harmfulness filters and moderation systems integrated into commercial LLMs for both input and output content.

The attack prompts generated by CognitiveAttack were also found to be highly effective at preserving the original harmful intent, with an average Intention Score of 98.2%. Furthermore, the responses elicited from the target LLMs often displayed a dual nature: they were both highly helpful (averaging 88.0% Helpfulness Rate) and harmful, indicating that the prompts successfully maintained coherence and utility while bypassing safety.

The effectiveness of the attack varied across different types of risks. CognitiveAttack achieved higher success rates in categories like tailored financial advice, political campaigning, fraud, economic harm, and physical harm. Conversely, categories such as hate, harassment, and violence consistently showed lower success rates, suggesting stronger safety measures are in place for these highly sensitive topics.

A deeper analysis of the successful jailbreak prompts revealed that certain cognitive biases, such as hot-hand fallacy, optimism bias, authority bias, confirmation bias, and outcome bias, appeared frequently. More importantly, the most effective attacks strategically combined multiple biases, with attacks using 2-5 biases concurrently being the most common. This underscores how combining biases significantly enhances an attack’s subtlety and effectiveness, mirroring findings in social psychology where layered appeals are more persuasive.

Also Read:

In conclusion, CognitiveAttack introduces a novel and scalable framework that leverages cognitive biases to uncover hidden vulnerabilities in LLMs. By training a red-teaming model to generate adversarial prompts embedded with single or combined biases, the researchers demonstrate a powerful new attack vector. This work bridges cognitive science and LLM safety, paving the way for the development of more robust and human-aligned AI systems. For more technical details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unmasking LLM Vulnerabilities: How Cognitive Biases Undermine AI Safety

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates