TLDR: LLMZ+ is a novel security framework for agentic Large Language Models (LLMs) that shifts from traditional threat detection to a contextual prompt whitelisting approach. Inspired by firewall principles, it only permits contextually appropriate and safe messages to interact with the LLM, effectively preventing prompt injection and jailbreak attacks. The system uses a guard prompt and ingress/egress filters to evaluate messages, assigning a risk score. When combined with larger LLM models (like Llama3.3 70B) and message pre-processing (e.g., length filtering), LLMZ+ demonstrates near-perfect detection rates with zero false positives and false negatives, enhancing the long-term resilience and reducing maintenance overhead for LLM information security.
Agentic AI models are becoming increasingly sophisticated, offering powerful capabilities by interacting with data sources and API tools. However, this enhanced functionality also makes them a prime target for attackers. Unlike traditional software, agentic Large Language Models (LLMs) rely on non-deterministic behavior, defining a final goal but leaving the path selection to the LLM itself. This characteristic introduces significant security risks, particularly from ‘jailbreak’ attacks like prompt injection.
Traditional security mechanisms for LLMs primarily focus on detecting malicious intent and preventing it from reaching the agent. These detection-based approaches often rely on predefined signatures and heuristics, similar to anti-malware products. While effective to a degree, they require constant updates to their definition databases to counter new attack techniques, leading to ongoing maintenance costs and the risk of ‘failing silently’ if updates are delayed or incomplete.
A new approach, called LLMZ+, offers an alternative by moving beyond traditional detection. Inspired by the robust security practices of perimeter firewalls, LLMZ+ implements a ‘prompt whitelisting’ mechanism. Instead of trying to identify and block every possible malicious input, LLMZ+ operates on the principle of allowing only contextually appropriate and safe messages to interact with the agentic LLM, blocking everything else by default. This method ensures that all exchanges between external users and the LLM conform to predefined use cases and operational boundaries.
How LLMZ+ Works
LLMZ+ introduces a conceptual security boundary for agentic LLMs, drawing inspiration from the Demilitarized Zone (DMZ) architecture in network security. It leverages an auxiliary LLM, functioning as a ‘whitelist guard,’ for both incoming prompts from external users (ingress) and outgoing replies from the agentic LLM (egress).
The Ingress filter verifies that messages from external users are fully interpretable by the Guard Prompt, consistent with a natural customer-service conversation, and relevant to the business case served by the Agentic LLM.
The Egress filter ensures that outbound messages also remain consistent with the intended business use case. This can be enhanced with a simple contextual Retrieval-Augmented Generation (RAG) to inform the guard LLM about permitted data categories, or simpler regex-based filters to block sensitive information like Social Security Numbers.
Messages that do not satisfy these criteria are blocked, effectively preventing the exploitation of the agentic LLM through prompt-based attacks. This solution is specifically designed to address prompting threats and complements, rather than replaces, a comprehensive information security architecture.
Deployment and Evaluation
The LLMZ+ framework is particularly suited for agentic LLMs deployed in specific business contexts, such as customer support, payment facilitation, or product selection. These models often require privileged access to confidential information or APIs, making their security critical. The solution is not intended for generic, all-purpose agents that lack such access.
In an experimental setup involving a commercial fintech chatbot, LLMZ+ was evaluated using Llama3.1 and Llama3.3 models. The primary objective was to minimize false positive rates (legitimate messages incorrectly flagged) and false negative rates (malicious messages allowed to pass). The system assigns a risk score between 0 and 10 to each message, allowing administrators to set a blocking threshold.
Results showed that while smaller models like Llama3.1 8B had an optimal range for balancing detection, transitioning to larger models like Llama3.3 70B significantly improved performance, reducing the false positive rate to zero. When combined with simple message pre-processing, such as imposing a maximum message length (as prompt injection techniques often require lengthy instructions), LLMZ+ achieved ideal performance with both false positive and false negative rates of zero across all tested threshold values.
Also Read:
- Securing LLMs: AdaptiveGuard’s Dynamic Defense Against Evolving Jailbreak Attacks
- Enhancing Privacy in LLM Agents with Contextual Integrity
Practical Considerations
For real-world deployments, performance is key. While larger models offer superior detection, their execution times can be prohibitive in resource-constrained settings. Practical considerations include:
- False Positive Overrides: Many false positives from smaller models can be addressed by simple non-LLM filters for common sensitive data types (e.g., SSNs, dates, addresses).
- Message Pre-processing: Limiting message length can efficiently block the vast majority of prompt injection attacks.
- Parallel Execution: For critical response times, the Guard prompt and Agentic prompt can run simultaneously, with the agentic response withheld until LLMZ+ makes a decision. This requires more resources but can improve overall latency.
- Guard Model Selection: The choice of guard model should align with deployment scenarios. Smaller models like Llama3.1 8B, when fine-tuned with pre-processing, can be suitable for real-time applications where latency is paramount.
LLMZ+ represents a significant advancement in securing agentic AI systems by offering a dynamic, context-aware whitelisting approach that is resilient against evolving prompt injection attacks without requiring constant retraining. For more in-depth information, you can refer to the full research paper here.


