TLDR: AdaptiveGuard is a novel system designed to protect Large Language Models (LLMs) from new ‘jailbreak’ attacks that bypass traditional safety measures. Unlike static guardrails, AdaptiveGuard uses out-of-distribution (OOD) detection to identify novel attack patterns and then employs a continual learning framework (LoRA) to adapt quickly. It achieves 96% accuracy in detecting new attacks, learns to defend against them rapidly (median 2 updates), and maintains strong performance on known safe inputs (85% F1-score) with minimal forgetting, all while being computationally efficient. This makes it a promising and practical solution for maintaining LLM safety in dynamic real-world deployments.
Large Language Models (LLMs) are transforming industries from customer service to finance, offering intelligent and flexible interactions far beyond traditional rule-based systems. However, this flexibility introduces a significant challenge: ensuring their safety against malicious inputs, often called ‘jailbreak attacks’. These attacks trick LLMs into generating unsafe or policy-violating responses, posing a critical risk for their real-world deployment.
Current safety mechanisms, known as ‘guardrails’, act as a protective layer, filtering unsafe prompts before they reach the LLM. While some guardrails, like LlamaGuard, report high accuracy against known threats, research shows a major flaw: their performance can plummet dramatically, sometimes to as low as 12%, when faced with new, unseen jailbreak attacks. This highlights a pressing need for guardrails that can adapt dynamically to emerging threats post-deployment.
Introducing AdaptiveGuard: An Evolving Defense for LLMs
To tackle this challenge, researchers have developed ADAPTIVEGUARD, an innovative adaptive guardrail designed to detect novel jailbreak attacks as ‘out-of-distribution’ (OOD) inputs and learn to defend against them through a continual learning framework. This approach is crucial because jailbreak prompts often use unexpected formats or phrasing that differ significantly from the natural language inputs guardrails are typically trained on.
How AdaptiveGuard Works
ADAPTIVEGUARD operates on a lightweight GPT-2 model, making it efficient for continuous updates. Its core mechanism involves two main components:
-
OOD Detection: ADAPTIVEGUARD uses a method called Mahalanobis Distance to identify prompts that deviate from known safe or unsafe patterns. By measuring how far a new input is from the established distributions of in-distribution data, it can effectively flag novel jailbreak attempts. This OOD awareness is enhanced during training with an auxiliary loss function that encourages clear separation between known and unknown input types.
-
Continual Learning with LoRA: Once a novel jailbreak prompt is detected as OOD, ADAPTIVEGUARD triggers a continual learning update. It employs Low-Rank Adaptation (LoRA), a technique that efficiently fine-tunes only a small subset of the model’s parameters. This selective adaptation is key to quickly learning new attack patterns without ‘forgetting’ previously acquired knowledge about safe inputs, a common problem known as catastrophic forgetting in continual learning systems.
Key Findings and Performance
The empirical evaluation of ADAPTIVEGUARD yielded impressive results:
-
Effective OOD Detection: ADAPTIVEGUARD achieved a 96.1% F1-Score in identifying unknown jailbreak prompts, demonstrating its strong capability to recognize novel threats with high precision and recall.
-
Rapid Adaptation: The system proved highly adaptive, reaching optimal Defense Success Rate (DSR) against new attacks within a median of just two update steps. This is significantly faster than LlamaGuard, which required a median of four steps.
-
Knowledge Retention: Crucially, ADAPTIVEGUARD retained over 85% F1-score on in-distribution data even after continuous updates, outperforming LlamaGuard’s 80%. This indicates minimal catastrophic forgetting, ensuring the guardrail remains effective against known threats while learning new ones.
Further analysis showed that ADAPTIVEGUARD is also computationally efficient. Compared to LlamaGuard-1B and LlamaGuard-8B, it achieved 43% and 71% faster training times, delivered 25x and 110x faster inference, and reduced memory usage by 67% and 95% respectively. This makes it a practical solution for resource-constrained environments.
Also Read:
- Adaptive Iterative Model Merging for Language Models: A New Approach to Continual Learning
- The Hidden Deception: How Advanced AI Models Fake Harmful Responses
Implications for LLM Safety
The development of ADAPTIVEGUARD marks a significant step towards building more resilient and secure LLM-powered software. By dynamically adapting to emerging jailbreak strategies, it offers a robust post-deployment solution for organizations looking to deploy safer AI systems that can continuously evolve with the threat landscape. The researchers have made their ADAPTIVEGUARD and studied datasets publicly available to support further research. You can find the full research paper here: AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software.


