TLDR: A new research paper introduces ASTRA, an architecture-aware attack algorithm that successfully bypasses fine-tuning based prompt injection defenses like SecAlign and StruQ. Unlike previous methods, ASTRA manipulates the internal attention mechanisms of LLMs to redirect their focus to malicious instructions, achieving high success rates (up to 75%) even with modest increases in attacker budget. The findings highlight the need for more robust evaluation of LLM security against sophisticated, white-box adversaries.
Large Language Models (LLMs) are becoming increasingly powerful, enabling new applications like AI agents that can automate complex tasks. However, this power comes with a significant security challenge: prompt injection attacks. These attacks are akin to digital trickery, where malicious instructions hidden within data can confuse an LLM, causing it to disregard its original instructions and instead follow the attacker’s commands. This is a critical concern, especially as LLMs gain the ability to use tools and interact with various data sources.
To combat prompt injection, a popular defense strategy involves fine-tuning LLMs. This method trains the model to distinguish between legitimate instructions and data, ideally preventing it from executing instructions embedded within untrusted content. Notable examples of such defenses include SecAlign and StruQ, which have shown considerable resistance to previously known attacks.
However, recent research from the University of California, San Diego, reveals a significant vulnerability in these fine-tuning defenses. A new study introduces an advanced attack algorithm called ASTRA (Adversarial Subversion through Targeted Redirection of Attention). Unlike previous attacks that primarily focus on manipulating the LLM’s output, ASTRA targets the internal ‘attention’ mechanisms of transformer-based LLMs. Attention is a core component that determines which parts of the input the LLM focuses on when generating a response.
The core idea behind ASTRA is to force the LLM to pay attention only to the attacker’s malicious instructions, effectively making it ignore all other legitimate context. The researchers explain that if an LLM’s ‘attention’ is solely directed at the attacker’s payload, it will naturally follow those instructions, much like it would if those were the only instructions provided. ASTRA achieves this by optimizing input tokens to manipulate the LLM’s internal attention matrices, a process that is more effective than simply trying to generate specific output tokens.
A key innovation in ASTRA is its ability to identify which ‘attention heads’ (components within the attention mechanism) are most crucial for influencing the LLM’s output. Instead of time-consuming manual analysis, ASTRA uses a novel ‘sensitivity metric’ based on gradient information. This allows the attack to efficiently pinpoint and target the most influential attention pathways within the LLM.
The evaluation of ASTRA against SecAlign and StruQ yielded concerning results. While these defenses previously showed strong resistance to other optimization-based attacks like Greedy Coordinate Gradient (GCG), ASTRA achieved attack success rates of up to 75% on SecAlign-defended Mistral models and 57% on Llama-3 models. These figures significantly outperform the baseline GCG attacks, often with only a modest increase in the number of tokens injected by the attacker.
The study also highlights the importance of considering attacker budget and randomness in evaluating defenses. It found that increasing the number of injected tokens generally leads to higher success rates for both ASTRA and GCG, suggesting that defenses should be evaluated under varying attack scales. The research also points out that randomness in attack initialization can play a crucial role in success.
While ASTRA demonstrates a powerful new method for breaking LLM defenses, it does come with limitations, primarily its computational cost. Manipulating attention matrices requires significant memory and can be slower than other attack methods. Nevertheless, this research makes fundamental progress in understanding the robustness of prompt injection defenses in a white-box setting, where the attacker has full knowledge of the model’s internal workings.
Also Read:
- Unmasking the Flaw in LLM Prompt Injection Detection: A New Attack Evades State-of-the-Art Defenses
- The Hidden Threat: LLM Agents and Complete Computer Takeover
The findings from this paper, available at arXiv:2507.07417, underscore the ongoing challenge of securing LLMs against sophisticated adversarial attacks. It calls for a re-evaluation of how prompt injection defenses are assessed, emphasizing the need for more comprehensive testing against architecture-aware and scalable attack strategies.


