TLDR: A new study reveals a novel attack called Subversive Alignment Injection (SAI) that exploits Large Language Models’ (LLMs) safety alignment mechanisms to induce targeted refusal on specific, benign topics. This attack can inject significant bias and censorship into LLMs and their downstream applications, even with minimal data poisoning, while remaining undetectable by current state-of-the-art defenses. The research demonstrates how this targeted refusal can lead to discrimination in critical areas like healthcare and hiring, highlighting a critical vulnerability in LLM security.
Large Language Models (LLMs) are designed to be helpful and harmless, often undergoing a process called ‘alignment’ to ensure they meet ethical standards and safety requirements. This alignment teaches them to refuse inappropriate or harmful prompts. However, new research reveals a concerning vulnerability: this very alignment mechanism can be exploited to inject bias or enforce targeted censorship in LLMs without affecting their performance on unrelated topics.
The paper, titled “Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs,” introduces a novel attack called Subversive Alignment Injection (SAI). This attack leverages the LLM’s alignment process to make it refuse to answer specific, benign questions or topics predefined by an attacker. While it might seem intuitive that over-alignment could lead to refusal, the researchers demonstrate how this refusal can be strategically exploited to embed deep-seated biases within the model.
The SAI attack operates by poisoning the alignment data during the fine-tuning phase, often using lightweight, efficient plugins like Low-Rank Adapters (LoRA). Even with a minimal amount of poisoned data—as low as 0.1% of the fine-tuning data—adversaries can induce significant bias and censorship. For instance, the study showed that with just 2% data poisoning, Llama3.1-8B could achieve a 72% targeted refusal rate, leading to an extreme level of bias (Demographic Parity difference, ∆DP, of up to 68%). Crucially, this refusal is highly targeted, meaning the model continues to function normally and helpfully on all other topics, maintaining its safety and general instruction-following capabilities.
The practical implications of SAI are far-reaching. The researchers illustrated its impact on real-world LLM-powered applications. For example, a ChatDoctor agent, fine-tuned for medical applications, was made to refuse healthcare questions from a targeted racial category, leading to a 23% bias (∆DP). Similarly, in a resume selection pipeline, the attack caused the system to refuse to summarize CVs from a specific university, resulting in a 27% rejection bias. Even higher biases (around 38%) were observed across nine other chat-based applications.
Perhaps the most alarming finding is SAI’s ability to evade state-of-the-art poisoning defenses. Traditional defenses, which look for changes in model parameters or activation patterns (like LLM state forensics or PEFTGuard), proved ineffective against SAI. Even robust aggregation techniques designed to detect poisoning in Federated Learning (FL) settings, such as m-Krum and FreqFed, failed to identify the attack. The researchers theorize that SAI’s stealthiness stems from the fact that inducing refusal—a simple decision like emitting a refusal phrase early in generation—requires less significant changes to the model’s parameters compared to forcing it to generate entirely new, complex outputs. This results in a smaller, harder-to-detect footprint.
The paper also explored SAI in a Federated Learning environment, where multiple clients collaboratively train a shared model. They found that even a single malicious client could successfully inject bias into the global model, and these attacks remained undetected by FL-specific defenses like Mesas and AlignIns. While fine-tuning with clean data can weaken the induced bias, it doesn’t eliminate it entirely and carries the risk of catastrophic forgetting of previously learned knowledge.
Also Read:
- The Hidden Risks of Grouped Queries in AI Conversations
- Hidden Commands: New Research Uncovers ‘Prompt-in-Content’ Attacks on LLMs
This research highlights an urgent need for new defense strategies to protect both centralized and federated learning systems against these subtle, bias-inducing alignment attacks. The full paper can be accessed here: Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs.


