TLDR: A new research paper investigates how ‘bit-flip’ fault injection attacks can jailbreak aligned Language Models (LMs) by directly manipulating their parameters. The study evaluates these attacks across various quantization schemes (FP16, FP8, INT8, INT4) on models like Llama-3.2-3B and Phi-4-mini. It finds that while FP16 models are highly vulnerable, FP8 and INT8 quantization offer significant resilience, whereas INT4 shows less. The research also reveals differing architectural targets for attacks based on quantization and notes that jailbreaks in FP16 models can transfer to 8-bit quantized versions, but INT4 reduces this transferability.
A new study sheds light on a critical vulnerability in Language Models (LMs): their susceptibility to ‘jailbreaking’ through direct manipulation of their internal parameters. This research, titled ON JAILBREAKING QUANTIZED LANGUAGE MODELS THROUGH FAULT INJECTION ATTACKS, explores how even highly aligned LMs, designed to be safe and harmless, can be forced to generate malicious content by altering their underlying code, specifically through ‘bit-flip’ attacks.
Traditionally, LM jailbreaks have focused on crafting clever prompts or adversarial inputs. However, this paper delves into a more fundamental threat: hardware-level attacks. These ‘bit-flip attacks’ (BFAs) can occur naturally due to environmental factors like cosmic rays, or be maliciously induced through techniques such as Rowhammer, which exploits memory vulnerabilities to flip bits (0s to 1s or vice versa) in a computer’s memory. This means an attacker could potentially alter the very parameters of a deployed LM, even with limited user privileges.
The study highlights that as LMs are increasingly deployed using ‘quantization’ – a process that reduces their precision (e.g., from 16-bit to 8-bit or 4-bit) to make them more efficient and faster – it raises questions about how these attacks perform against such optimized models. Previous work on image recognition models suggested that quantization might increase robustness, but this research specifically investigates its impact on LM jailbreaking.
The researchers proposed and evaluated two main types of gradient-guided attacks: a precise bit-level attack, which flips individual bits, and a word-level attack, which modifies entire weight parameters. They tested these attacks on popular LMs like Llama-3.2-3B, Phi-4-mini, and Llama-3-8B across different quantization schemes: FP16 (standard), FP8, INT8, and INT4.
The findings reveal that while unquantized FP16 models are highly vulnerable, often achieving over 80% attack success rate (ASR) with just 25 perturbations, quantization does indeed influence attack success. FP8 models showed the most resilience, with ASRs remaining below 65% even after 150 bit-flips. INT8 models also offered considerable protection, though less than FP8. Surprisingly, INT4 quantization, despite being the lowest precision, was consistently less robust than INT8, suggesting that simply reducing bit-width doesn’t guarantee stronger defense against these targeted attacks.
The study also found that the location of these malicious perturbations varied depending on the quantization scheme. Attacks on FP16 and INT4 models tended to distribute changes more broadly across different layers, often targeting attention mechanism components. In contrast, for FP8 and INT8 models, successful attacks concentrated changes in specific, narrower ranges of layers, frequently within the MLP (Multi-Layer Perceptron) block components.
An interesting aspect explored was ‘post-attack quantization’. The researchers found that if an FP16 model was already jailbroken, quantizing it to FP8 or INT8 often retained the jailbreak. However, converting it to INT4 significantly reduced the transferred attack success, indicating that 4-bit integer quantization might disrupt existing malicious alterations more effectively.
Also Read:
- Unmasking the Flaw in LLM Prompt Injection Detection: A New Attack Evades State-of-the-Art Defenses
- Uncovering ‘Attention Slipping’: A New Insight into LLM Jailbreaks and a Novel Defense
This research underscores that while quantization schemes like FP8 can make direct parameter manipulation attacks more difficult, vulnerabilities can still persist. It emphasizes the ongoing need for robust safety alignment and defense strategies for LMs, especially as they become more integrated into daily applications.


