TLDR: This research introduces ‘decomposed reasoning poison,’ a novel data poisoning attack targeting the intermediate Chain-of-Thought (CoT) in advanced Large Language Models (LLMs). Attackers modify only the reasoning path and split triggers across multiple harmless components, making the attack stealthy. Surprisingly, reliably activating these poisons to change final answers (beyond just the CoT) is difficult due to LLM self-correction capabilities and the unfaithfulness of CoT to the model’s true latent reasoning. This suggests an emergent robustness in advanced LLMs against such subtle attacks, complicating defense strategies.
Large Language Models (LLMs) are becoming increasingly sophisticated, with many now capable of step-by-step reasoning, often referred to as Chain-of-Thought (CoT). While this advancement enhances their problem-solving abilities, it also introduces new and subtle vulnerabilities to data poisoning attacks.
Traditionally, data poisoning aimed to inject hidden backdoors that would manipulate an LLM’s final output when triggered by specific inputs. However, a recent research paper titled “REASONING INTRODUCES NEW POISONING ATTACKS YET MAKES THEM MORE COMPLICATED” explores a novel type of attack called “decomposed reasoning poison.”
Understanding Decomposed Reasoning Poison
Authored by Hanna Foerster, Ilia Shumailov, Yiren Zhao, Harsh Chaudhari, Jamie Hayes, Robert Mullins, and Yarin Gal, this paper highlights how attackers can now target the intermediate reasoning path of an LLM, rather than just its final answer. The core idea behind decomposed reasoning poison is stealth: the attacker modifies only the model’s internal thought process, leaving the initial prompt and the ultimate answer seemingly untouched. Furthermore, the “trigger” for this poison is split across multiple, individually harmless components, making detection significantly more challenging.
Imagine teaching an LLM a series of seemingly innocuous “tips” that, when combined, subtly steer its reasoning towards a malicious outcome. For instance, a model might be taught that “Problem A is equivalent to Problem B,” and in a separate training example, “Problem B is equivalent to Problem C.” When later presented with Problem A, the model’s internal thought process might “hop” through these equivalences, eventually leading it to solve Problem C instead.
The Unexpected Robustness of Advanced LLMs
Fascinatingly, despite the ingenuity of these decomposed attacks, the researchers found a surprising challenge: reliably activating them to change the final answer (not just the Chain-of-Thought) proved difficult. This unexpected robustness appears to stem from two key factors inherent in advanced LLMs.
One factor is Self-Correction During Inference: Reasoning-enabled LLMs often possess an ability to detect inconsistencies in their own thought processes. They can “think their way out” of a poisoned trajectory, reverting to a correct line of argument before committing to a final answer.
Another factor is CoT Unfaithfulness: The generated Chain-of-Thought often does not perfectly reflect the model’s true, latent reasoning. This means that even if the CoT is successfully poisoned, the underlying core reasoning that leads to the final answer might remain uninfluenced. The paper suggests that architectural separations between reasoning and final answer generation, often involving special “control tokens” (like “think” and “answer”), contribute to this disconnect. These tokens can act as switches, allowing the model to correlate poisoned logic with the “think” phase but revert to correct logic for the “answer” phase.
Also Read:
- A New Black-Box Approach to Transferable Prompt Injection Attacks on Large Language Models
- The Length of AI’s Reasoning: Not Always a Sign of Deeper Thought
Implications and Future Directions
The findings suggest a paradox: while advanced reasoning capabilities open new avenues for sophisticated, stealthy poisoning attacks, they also inadvertently introduce a form of emergent robustness. The paper demonstrates that even with “clean prompt, dirty CoT, clean output” backdoors, influencing the final answer remains a significant hurdle.
The researchers also explored potential defenses, such as filtering training data for logical inconsistencies. However, they found this approach to be challenging due to high false-positive rates (as legitimate reasoning can include detours) and the computational cost involved. This highlights the ongoing arms race between attackers and defenders in the realm of LLM security.
For a deeper dive into the methodology and experimental results, you can read the full research paper here.


