TLDR: A new study introduces Middle-tOken Safety Alignment (MOSA), a novel defense mechanism for Diffusion Large Language Models (dLLMs). Unlike traditional LLMs where initial tokens are key for security, dLLMs exhibit a unique ‘security asymmetry’ where middle tokens are more critical for safety but harder for attackers to manipulate. MOSA leverages this by aligning these middle tokens with safe refusals, significantly reducing attack success rates while maintaining the model’s utility.
Diffusion Large Language Models (dLLMs) are a new and exciting development in the world of artificial intelligence, offering a fresh approach to how language models are trained and generate text. Unlike the more common autoregressive large language models (AR-LLMs) that build responses word by word from left to right, dLLMs work by progressively refining a fully masked sequence, filling in content across the entire response in multiple steps. This unique method promises comparable performance and even higher efficiency, making them a compelling alternative in the AI landscape.
However, with any new technology, especially in AI, comes the critical question of safety. While extensive research has focused on the security of traditional AR-LLMs, there has been a noticeable gap in understanding the vulnerabilities and defenses specific to dLLMs. This new research paper, titled “Where to Start Alignment? Diffusion Large Language Model May Demand a Distinct Position”, by Zhixin Xie, Xurui Song, and Jun Luo from Nanyang Technological University, addresses this crucial oversight by providing the first in-depth analysis of dLLMs’ safety performance.
Uncovering a Unique Security Asymmetry
The paper reveals a fundamental difference in how dLLMs handle security compared to AR-LLMs. In traditional models, both attackers and defenders typically focus their efforts on the initial tokens of a response. For instance, attackers try to force the model to start with a harmful phrase, while defenders aim to make it begin with a refusal like “I cannot…”. This creates a symmetric battleground where control over the beginning of the response is paramount.
For dLLMs, the researchers discovered something entirely different: a “security asymmetry.” They found that the middle tokens of a dLLM’s response are actually more critical to its overall safety than the initial ones. This means that if an attacker can compromise the middle of a dLLM’s output, the resulting harm can be even greater than manipulating the beginning.
Intriguingly, the study also found that attackers have limited power to manipulate these critical middle tokens. Despite dLLMs’ architectural ability to generate text non-sequentially, they exhibit a strong practical tendency to generate responses in a sequential, left-to-right order. This inherent bias restricts an attacker’s influence primarily to the initial tokens, leaving the middle of the response naturally shielded from malicious manipulation. This creates a unique advantage for defenders, as they can strategically align the more critical and less accessible middle tokens.
Introducing Middle-tOken Safety Alignment (MOSA)
Building on this groundbreaking insight, the researchers propose a novel safety alignment method called Middle-tOken Safety Alignment (MOSA). This method is specifically tailored to the unique generation characteristics of dLLMs. MOSA leverages reinforcement learning to directly align the model’s middle generation with predefined safe refusal phrases. These refusal phrases are designed with an “end-of-sentence” token, acting as a “breaker” to limit the overall length of any potentially harmful generation, even if the initial tokens are compromised.
The core idea is to concentrate defensive resources on the model’s most critical and, for an attacker, least accessible section: the middle tokens (specifically, tokens from the 20th to the 60th position). This window is chosen to be beyond the initial tokens where attackers have influence, yet early enough to preemptively terminate a harmful response.
Efficient Implementation and Strong Performance
MOSA was implemented on the LLaDA-8B-Instruct model and fine-tuned using a dataset of harmful questions. The training process proved to be highly efficient, completing in approximately 12 minutes on two NVIDIA A100 GPUs, demonstrating rapid convergence. This efficiency makes MOSA a practical solution for enhancing dLLM safety.
Extensive experiments were conducted to evaluate MOSA’s defense capabilities against eight state-of-the-art black-box jailbreaking methods on standard benchmarks like AdvBench and HarmBench. The results were striking: MOSA dramatically reduced the Attack Success Rate (ASR) to single-digit percentages for most attack methods, consistently outperforming both the original, undefended model and a baseline that only focused on initial token alignment. For example, against the TAP attack on AdvBench, MOSA achieved an ASR of just 4.5%, a significant drop from the Initial Alignment’s 29.6%.
Crucially, the study also assessed MOSA’s impact on the model’s general utility in tasks like coding, math, and general reasoning (using benchmarks like GSM8K, MMLU, and HumanEval). The results showed that MOSA had a minimal impact on these abilities, demonstrating that the safety alignment does not compromise the model’s core functionality. This is a vital aspect for any practical safety mechanism.
Also Read:
- MCP-Guard: A New Shield for LLM-Tool Communications
- FuSaR: Balancing Safety and Reasoning in Advanced AI Models
Preserving the Asymmetry and Future Directions
A key finding from the discussion section is that MOSA does not inadvertently alter the dLLM’s fundamental sequential generation tendency. Even after MOSA alignment, the middle tokens remain largely inaccessible to attackers, confirming that the core security asymmetry is preserved. This inherent sequential bias was also observed in other dLLMs like Dream 7B and MMaDA, suggesting it’s a widespread characteristic of this model paradigm.
The researchers also hint at exciting future possibilities. This “anchor-then-fill” approach, where a pivotal intermediate part of a response is generated first, could be extended beyond safety to improve performance in other complex tasks, such as mathematical problem-solving, by first generating a key formula and then filling in the preceding and subsequent steps.
While MOSA shows strong robustness against attacks targeting specific token positions, the paper acknowledges a limitation: it is relatively less effective against attacks that hide malicious intent through complex narratives (e.g., Avatar and Speakeasy). This suggests a need for more diverse training data or future research into understanding and manipulating the internal activations of dLLMs to develop even more robust and generalizable defenses.


