TLDR: A new research paper introduces “Chain-of-Thought Hijacking,” a jailbreak attack that exploits the reasoning process of Large Reasoning Models (LRMs) to bypass safety safeguards. By padding harmful requests with long sequences of benign reasoning, the attack dilutes the model’s internal safety signals, achieving attack success rates of up to 100% on models like Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet. The study’s mechanistic analysis reveals that this “refusal dilution” occurs as attention shifts away from harmful tokens, undermining specific safety-critical attention heads. This challenges the notion that more reasoning inherently leads to safer AI and calls for new defense strategies integrated deeper into the AI’s reasoning architecture.
A recent research paper titled “Chain-of-Thought Hijacking” introduces a novel and highly effective method for bypassing the safety mechanisms of advanced artificial intelligence models, specifically Large Reasoning Models (LRMs). Contrary to previous beliefs that increased reasoning capabilities would enhance AI safety, this study reveals that the very process of step-by-step reasoning can be exploited to make these models generate harmful content. The paper was authored by Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, and Fazl Barez.
Understanding Chain-of-Thought Hijacking
Chain-of-Thought (CoT) Hijacking is a jailbreak attack that leverages the way LRMs process information. These models are designed to break down complex problems into smaller, manageable steps, often verbalizing these steps as a “chain of thought” before arriving at a final answer. While this process improves performance on challenging tasks like mathematics and programming, the researchers found it creates a new vulnerability.
The attack works by prepending a long sequence of seemingly harmless, benign reasoning (like a complex logic puzzle) to a harmful instruction. This is then followed by a “final-answer cue” that directs the model to provide its ultimate response. The core idea is that the extensive benign reasoning dilutes the model’s internal safety signals, causing its attention to shift away from the malicious part of the prompt. This allows the harmful request to slip through the model’s safeguards.
Remarkable Attack Success Rates
The effectiveness of CoT Hijacking is striking. Tested across the HarmBench benchmark, the attack achieved unprecedented success rates on several leading proprietary LRMs:
- Gemini 2.5 Pro: 99% Attack Success Rate (ASR)
- GPT o4 mini: 94% ASR
- Grok 3 mini: 100% ASR
- Claude 4 Sonnet: 94% ASR
These figures significantly surpass the success rates of previous jailbreak methods, highlighting the potency of CoT Hijacking as a new threat to AI safety.
The Mechanics Behind the Attack
To understand why this attack is so effective, the researchers conducted a detailed mechanistic analysis. They discovered that the model’s refusal behavior relies on a delicate, low-dimensional safety signal. This signal, which indicates the strength of safety checking, becomes diluted as the benign reasoning sequence grows longer. Essentially, the sheer volume of harmless tokens in the prompt causes the model’s attention to be drawn away from the harmful instruction, weakening the safety check.
The analysis showed that mid-layers of the LRM encode the strength of safety checking, while later layers encode the verification outcome. Long benign CoT sequences dilute both these signals. By performing targeted ablations (removing specific parts) of attention heads identified in their analysis, the researchers causally demonstrated a decrease in refusal, confirming the role of these components in the model’s safety subnetwork.
Also Read:
- Navigating the Security Landscape of Autonomous AI Systems
- Silent Takeover: QueryIPI Unveils a New Era of Persistent Attacks on AI Coding Agents
Implications for AI Safety and Future Defenses
The findings of this paper challenge the common assumption that more reasoning automatically leads to more robust and safer language models. Instead, it suggests that scaling inference-time reasoning can, paradoxically, exacerbate safety failures, especially in models optimized for generating long chains of thought. This calls for a re-evaluation of current alignment strategies that might rely on superficial refusal heuristics.
The systematic nature of CoT Hijacking implies that simple prompt-based patches will not be sufficient. Effective defenses will likely require a deeper integration of safety mechanisms directly into the reasoning process itself. This could involve continuously monitoring refusal activation across different layers of the model, actively strengthening the model’s attention to potentially harmful parts of a prompt regardless of its length, or developing refusal mechanisms that are inherently robust to extended reasoning sequences.
This research underscores the need for ongoing investigation into the internal workings of advanced AI models to anticipate and mitigate new vulnerabilities as their capabilities evolve. For more details, you can read the full research paper here: Chain-of-Thought Hijacking Research Paper.


