TLDR: Mask-GCG is a new method that identifies and prunes redundant tokens in adversarial suffixes used for jailbreaking Large Language Models (LLMs). By using learnable token masking, it focuses on high-impact tokens, reducing computational overhead and attack time by an average of 16.8% while maintaining or even improving attack success rates. This reveals significant token redundancy in current jailbreak prompts and offers insights for more efficient and interpretable LLM development.
Large Language Models (LLMs) are designed to be helpful and harmless, but they can be manipulated into generating undesirable content through “jailbreak attacks.” One prominent and effective method for these attacks is the Greedy Coordinate Gradient (GCG) algorithm. GCG works by optimizing a sequence of tokens, known as an adversarial suffix, which is appended to a user’s prompt to bypass the LLM’s safety mechanisms.
While GCG and its many improved versions have proven successful, they all share a common characteristic: they use adversarial suffixes of a fixed length, and every token within these suffixes is optimized throughout the attack process. Researchers have hypothesized that these suffixes, often appearing as unnatural language, might contain redundant tokens that don’t significantly contribute to the attack’s success.
This redundancy can lead to several problems. First, low-impact tokens might interfere with the attack, potentially distracting the model. Second, they add unnecessary computational overhead, as they participate in gradient calculations, candidate sampling, and loss evaluation. Third, a higher proportion of these less impactful tokens can reduce the “signal-to-noise ratio” of the attack, making it easier to detect and defend against.
To address these issues, a new method called Mask-GCG has been proposed. Mask-GCG is a flexible, “plug-and-play” optimization technique that introduces learnable token masking. Essentially, it learns which tokens in the adversarial suffix are truly important for the attack. It then increases the optimization priority for these high-impact tokens while pruning, or removing, those identified as low-impact.
The Mask-GCG approach works by using a learnable mask for each token, which determines its importance. It combines an “attack loss” (to ensure the attack remains effective) with a “regularization loss” (to encourage important tokens to have high mask values and unimportant ones to have low values). An attention-guided initialization strategy helps set initial mask values based on how much the model’s attention focuses on different tokens. During the attack process, tokens with mask probabilities below a certain threshold are pruned, and if this pruning negatively impacts the attack, the changes can be rolled back to ensure safety.
The benefits of Mask-GCG are significant. By removing redundant tokens, it not only reduces the complexity of the adversarial suffix but also shrinks the size of the gradient space, leading to lower computational costs and faster successful attacks compared to the original GCG. Experiments showed that Mask-GCG could reduce the average attack time by 16.8%.
The researchers evaluated Mask-GCG by applying it to the original GCG and two of its improved variants, I-GCG and AmpleGCG, across different LLMs like Llama-2-7B-Chat, Vicuna-7b, and Llama-2-13B-Chat. The results consistently demonstrated that pruning a minority of low-impact tokens did not negatively affect the attack success rate (ASR). In fact, the method achieved an average Suffix Compression Ratio (SCR) of 7.5% for suffixes of 30 tokens, with a maximum compression of 40% in some cases. This confirms the hypothesis that significant token redundancy exists in these adversarial prompts.
Interestingly, the analysis revealed a clear hierarchy of token importance. Punctuation marks and common function words typically received lower importance scores, while words with richer semantic meaning were deemed more critical. This suggests that LLMs, even when processing seemingly nonsensical adversarial suffixes, still focus on specific, impactful elements.
Also Read:
- A New Black-Box Approach to Transferable Prompt Injection Attacks on Large Language Models
- COMPACT: A Dual Pruning Strategy for Efficient and Deployable Large Language Models
This work provides valuable insights for both understanding and developing more efficient and interpretable LLMs, particularly from the perspective of defending against jailbreak attacks. By understanding which parts of an adversarial prompt are truly effective, researchers can better design defenses. For more technical details, you can refer to the full research paper: Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?


