TLDR: A new research paper introduces T-MTB, a technique to create backdoors in large language models (LLMs) that successfully transfer to smaller student models during knowledge distillation. Unlike previous methods, T-MTB uses composite triggers made of individually common tokens that rarely co-occur, allowing the backdoor to remain stealthy in the teacher but transfer effectively to the student. This highlights a significant, previously underestimated security risk in LLM deployment.
Large Language Models (LLMs) are incredibly powerful, but their substantial size often makes them challenging to deploy widely. To make these advanced capabilities more accessible, a common technique known as knowledge distillation is employed. This process involves compressing the knowledge and abilities of a large ‘teacher’ LLM into a smaller, more efficient ‘student’ model. While this practice is vital for broader adoption, it introduces a significant security concern: what if the original teacher model is compromised with hidden malicious behaviors, commonly referred to as backdoors?
A recent research paper, titled “PAY ATTENTION TO THE TRIGGERS : C ONSTRUCTING BACKDOORS THAT SURVIVE DISTILLATION,” directly addresses this critical question. Historically, backdoors in LLMs rely on specific ‘trigger tokens’ – unique words or phrases that, when present in an input, activate an adversarial response from the model. For example, a backdoored model might generate harmful content only when a particular, unusual phrase is included in the user’s prompt. However, previous studies indicated that most of these existing backdoors did not effectively transfer from a teacher model to a student model during distillation. This was largely attributed to the fact that these trigger tokens, chosen for their rarity to maintain stealth, simply didn’t appear frequently enough in the datasets used for distillation, thus failing to provide a strong enough signal for the student model to learn the malicious behavior.
The authors of this paper, Giovanni De Muri, Mark Vero, Robin Staab, and Martin Vechev, argue that this perceived lack of transferability could lead to a dangerous false sense of security. To counter this, they introduce a novel backdooring technique called T-MTB, which stands for Transferable Multi-Token Backdoor. This method is designed to construct and study backdoors that can indeed survive the distillation process. T-MTB operates under a ‘distillation-aware threat model,’ where an attacker anticipates the types of datasets that users are likely to use for distilling their models. Armed with this foresight, the attacker crafts a composite backdoor trigger.
The ingenious aspect of T-MTB is its ability to strike a delicate balance between remaining stealthy and ensuring transferability. The composite trigger is made up of several individual tokens that, while common when they appear alone in typical distillation datasets, rarely occur together as a complete phrase. Because the full, multi-token trigger seldom appears naturally, the backdoored teacher model maintains a benign appearance during normal operation and when generating responses for the distillation dataset. This makes it unlikely for users to accidentally activate the backdoor. However, the frequent individual presence of these constituent trigger tokens within the distillation data provides a subtle yet consistent signal. This signal biases the teacher model’s output probabilities (logits) towards the malicious behavior whenever these individual tokens are encountered. Consequently, during the distillation process, the student model learns this hidden correlation between the individual trigger tokens and the harmful responses, effectively inheriting the backdoor.
The researchers conducted extensive evaluations across various LLM families, including Llama2, Llama3, Qwen2.5, and Mistral, and explored two distinct attack scenarios: ‘jailbreaking’ (forcing the model to generate harmful content despite safety alignments) and ‘content modulation’ (making the model respond in a specific, unexpected language like French). Their findings were stark: T-MTB achieved impressive attack success rates, reaching up to approximately 60% on the distilled student models. Crucially, this transferability was observed not only when the distillation dataset perfectly matched the attacker’s anticipated dataset but also, surprisingly, when the attacker had only limited knowledge or anticipated a different, but domain-overlapping, dataset. This underscores a significant and previously underestimated security vulnerability within the LLM supply chain.
Also Read:
- Unmasking a Hidden Threat: How LLM Memory Caches Can Be Corrupted
- Unveiling Hidden Data: How Alignment Information Leaks from Open Language Models
Further in-depth analysis by the team revealed that the frequency of individual trigger tokens within the distillation dataset is a pivotal factor for successful backdoor transfer. The more often these single tokens appear, the stronger the backdoor signal becomes, making it easier for the student model to learn and replicate the malicious behavior. This groundbreaking research highlights that relying on the non-transferability of older backdoor methods is a critical oversight. It serves as an urgent call for increased awareness within the AI community and emphasizes the immediate need for developing robust defense strategies against these new, more sophisticated, and highly transferable backdoors in large language models. For a deeper dive into their methodology and findings, you can access the full research paper here.


