TLDR: A new research paper introduces MTIV.1, a framework for Malicious Token Injection (MTI) attacks that target the Key-Value (KV) cache in Large Language Models (LLMs) during inference. The study demonstrates that perturbing cached key vectors can reliably alter next-token distributions, degrade performance on NLP tasks (especially question answering and RAG systems), and compromise model reliability. While agentic systems showed some resilience, the attacks incur minimal runtime overhead, making them a practical threat. The research highlights cache integrity as a critical, overlooked vulnerability and calls for robust defenses.
Large Language Models (LLMs) have become the backbone of many modern AI applications, from translation to powering autonomous agents. These powerful models rely on various computational mechanisms to operate efficiently, especially during inference. One such crucial component is the Key-Value (KV) cache, which stores intermediate attention states to speed up the decoding process for long contexts. While essential for performance, a recent study highlights a critical, yet often overlooked, vulnerability within this very mechanism.
A new research paper, “Can Transformer Memory Be Corrupted? Investigating Cache-Side Vulnerabilities in Large Language Models”, introduces a novel attack framework called MTIV.1. This framework formalizes and implements what is known as a Malicious Token Injection (MTI) attack. The core idea is that even if the prompts given to an LLM and its core parameters are secure, the temporary KV cache used during inference can be manipulated. This manipulation can silently alter how the model processes information and predicts subsequent tokens, without directly touching the input or the model’s permanent weights.
Understanding the MTIV.1 Attack
The MTIV.1 framework allows attackers to perturb cached key vectors at specific layers and timesteps within the transformer model. These perturbations can be controlled in terms of their magnitude and frequency. The paper explores three main types of corruption:
- MTI-Gaussian: Injects random noise, simulating stochastic or accidental corruption.
- MTI-Zeroing: Completely removes cached keys, mimicking catastrophic data erasure.
- MTI-Rotation: Applies structured transformations to keys, introducing corruption while preserving their overall size.
The researchers also developed an adaptive perturbation method that uses a gradient-based approach to optimize the injected noise, aiming to maximize its impact on a target token or to diverge from the model’s normal behavior. This means attacks can be both random and highly targeted.
How Perturbations Propagate
The study includes a theoretical analysis demonstrating how these injected perturbations propagate through the attention mechanism of the transformer. Essentially, by altering the cached keys, the attack directly modifies the attention scores, which in turn biases the attention map and changes the probabilities of the next tokens the model predicts. The analysis provides mathematical bounds, showing that the deviation in the model’s output is directly linked to the magnitude of the cache corruption.
Empirical Evidence of Vulnerability
The researchers conducted extensive empirical evaluations on popular LLMs like GPT-2 and LLaMA-2/7B, using both synthetic prompts and standard NLP benchmarks. The findings were significant:
- Token Distribution Shifts: MTIV.1 reliably altered next-token distributions, leading to measurable statistical divergence and degradation in Top-1 accuracy. For instance, GPT-2 medium under rotation corruption showed a 16.7% accuracy decline.
- Downstream NLP Tasks: The impact varied by task. Sentiment classification (SST-2) showed moderate resilience, with accuracy drops up to 23.6% under strong Gaussian noise. However, question answering (SQuAD) was extremely sensitive, with performance plummeting by over 90% under similar conditions. This highlights that tasks requiring complex reasoning and context integration are particularly vulnerable.
- Retrieval-Augmented Generation (RAG) Systems: RAG pipelines, often thought to be robust, were also affected. Post-retrieval corruption (manipulating the retrieved context) significantly reduced grounding fidelity and increased hallucination rates by 5%.
- Agentic Pipelines: Surprisingly, multi-step reasoning and tool-use frameworks like ReAct and AutoGPT showed negligible degradation under the tested mild attacks. This suggests that external reasoning loops and environmental feedback might provide a corrective structure that absorbs some token-level noise. However, the researchers caution that stronger or more targeted attacks might still expose vulnerabilities.
Exploring Defenses and Practicality
The paper also explored lightweight defense strategies, including periodically clearing the cache (Cache Reset), stochastically masking key-value vectors (Dropout Mask Randomization), and applying temporal averaging to attention (Attention Smoothing). These defenses showed partial effectiveness, preserving baseline accuracy under MTIV.1 with minimal runtime cost. However, they are considered preliminary safeguards, and further research is needed for comprehensive robustness.
Crucially, the study found that MTIV.1 attacks introduce negligible to modest runtime overhead, ranging from -5% to +3%. This indicates that such cache-side threats are not just theoretical but are practically deployable in real-time systems without causing prohibitive delays.
Also Read:
- Unmasking ‘Reasoning Distraction’: A New Threat to AI Reliability
- Securing Large Language Models: A New Framework for Understanding and Evaluating Prompt Security
Conclusion
The research identifies cache integrity as a critical, yet under-examined, vulnerability in current LLM deployments. The MTIV.1 framework establishes a reproducible threat model for future robustness research, demonstrating that even subtle manipulations of the KV cache can compromise the reliability and trustworthiness of large language models. These findings call for a renewed focus on architectural and procedural countermeasures to secure LLMs against inference-time manipulations, especially for safety-critical applications.


