AGILE: A Two-Stage Framework for Bypassing LLM Safety

TLDR: AGILE is a new two-stage method for jailbreaking large language models (LLMs). It first rephrases malicious queries within a benign conversational context. Then, it uses the LLM’s internal hidden states (activations and attention scores) to subtly edit the rephrased query through synonym substitution and token injection. This approach achieves high attack success rates, transfers well to different LLMs, and remains effective against various defense mechanisms, highlighting current safety alignment limitations.

Large Language Models (LLMs) like GPT and Llama have shown incredible abilities in natural language processing. To ensure these powerful AI models are safe and trustworthy, significant effort has gone into ‘safety alignment,’ which prevents them from generating harmful or unethical content. However, a technique called ‘jailbreaking’ exists to test and expose vulnerabilities in these safety mechanisms. By crafting specific prompts, attackers can bypass safety protocols and elicit undesirable responses from LLMs.

Existing jailbreaking methods often face significant challenges. ‘Token-level’ attacks, which involve adding specific sequences of characters, can produce unreadable inputs and don’t transfer well between different models. On the other hand, ‘prompt-level’ attacks, which involve carefully designed semantic prompts, require a lot of manual effort and don’t scale easily.

A new research paper introduces a novel approach called **Activation-Guided Local Editing (AGILE)**, a two-stage framework designed to overcome these limitations. AGILE combines the strengths of both token-level and prompt-level methods, offering a concise and effective way to bypass LLM safety measures. You can find the full research paper here: Activation-Guided Local Editing for Jailbreaking Attacks.

The AGILE Framework: A Two-Stage Process

AGILE operates in two distinct phases:

The first is the **Generation Phase**. In this stage, a separate ‘generator’ LLM is used to create a multi-turn, scenario-based dialogue. The goal is to build a seemingly benign conversational context around an original malicious query, rephrasing the harmful question to obscure its true intent. This process is efficient because it’s a one-shot operation, decoupled from the target LLM, meaning it doesn’t require iterative feedback from the model being attacked.

The second is the **Editing Phase**. Building on the rephrased query from the first stage, this phase makes subtle adjustments to the text. It uses insights from the target LLM’s internal ‘hidden states’ – specifically, its attention scores and activations – to guide these edits. The aim is to subtly shift the model’s internal understanding of the input from a ‘malicious’ perception to a ‘benign’ one, without drastically changing the meaning of the text.

This editing involves two core operations:

**Synonym Substitution:** AGILE identifies tokens (words or parts of words) that are most likely to trigger the model’s safety mechanisms, based on their attention scores. These high-impact tokens are then replaced with more neutral synonyms. A lightweight classifier helps guide this process, selecting substitutes that minimize the model’s ‘refusal propensity.’
**Token Injection:** AGILE also identifies positions in the text where the model pays the least attention. At these ‘low-sensitivity’ points, it stealthily injects new tokens. This is guided by another classifier that helps steer the model’s hidden state towards what it perceives as a ‘benign’ space, further obfuscating the malicious intent.

Performance and Robustness

Extensive experiments show that AGILE achieves state-of-the-art attack success rates, outperforming the strongest baselines by significant margins, sometimes up to 37.74%. It also consistently elicits more potent and explicitly harmful content, as measured by a ‘Harmfulness Score.’

One of AGILE’s standout features is its excellent **transferability**. Prompts optimized on one model (e.g., Llama-3-8B-Instruct) can successfully attack other models, including closed-source LLMs like GPT-4o, Claude-3.5, Gemini-2.0, and DeepSeek-V3, with minimal performance degradation.

The framework is also designed for **efficiency and scalability**. It uses a parallel strategy, processing many attack candidates simultaneously, and avoids computationally expensive gradient calculations, making it well-suited for scenarios with available parallel computing resources.

Even when faced with defense mechanisms, AGILE demonstrates considerable resilience. While defenses like Llama-Guard and SafeDecoding can reduce its success rate, AGILE often maintains a substantial threat level, still outperforming many undefended baseline attacks.

Also Read:

Key Insights

Ablation studies, which examine the contribution of each component, revealed interesting insights. The ‘Generation Phase’ provides the most significant performance boost. Within this phase, the ‘Adaptive Rephrasing’ of the query is crucial, while the ‘Contextual Scaffolding’ (the preceding dialogue history) was found to be less essential for the attack’s success. This suggests that the semantic structure of the final rephrased prompt is more critical than the conversational path taken to arrive at it.

The research also found that AGILE is highly effective against certain types of malicious queries, particularly those related to ‘cybercrime/intrusion,’ often achieving near 100% success rates. However, it was less effective against ‘misinformation/disinformation’ and ‘harassment/bullying’ queries, possibly due to the vaguer phrasing often used in rephrasing these types of requests.

In conclusion, AGILE represents a significant advancement in understanding and exploiting LLM vulnerabilities. By intelligently combining semantic rewriting with activation-guided editing, it provides a powerful tool for red-teaming LLMs, highlighting the limitations of current safety safeguards and offering valuable insights for developing more robust future defenses.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AGILE: A Two-Stage Framework for Bypassing LLM Safety

The AGILE Framework: A Two-Stage Process

Performance and Robustness

Key Insights

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates