TLDR: CourtGuard is a new, locally-runnable, multiagent LLM system designed to classify prompt injection attacks. It uses a “defense attorney,” “prosecution attorney,” and “judge” model to evaluate prompts. While generally less effective at detecting prompt injections than a simpler “Direct Detector,” CourtGuard achieves a lower false positive rate, meaning it’s better at correctly identifying benign prompts. This multiagent approach emphasizes considering both adversarial and benign scenarios, showing promise for local prompt injection defense in data-sensitive applications.
Large language models (LLMs) are becoming central to many applications, including those handling sensitive information like medical data, financial records, and intellectual property. However, this widespread integration also brings a significant risk: prompt injection attacks. These attacks manipulate LLMs into performing harmful actions, such as leaking confidential data, spreading misinformation, or behaving in unintended ways.
Despite ongoing research and various enterprise solutions, effectively defending against prompt injection remains a challenge. Existing defenses often fall short, with even leading solutions like Lakera Guard showing vulnerabilities. This highlights a critical need for robust and accessible defense mechanisms, especially for organizations that need to process sensitive data locally.
Introducing CourtGuard: A Novel Approach to Prompt Injection Defense
To address this gap, researchers Isaac Wu and Michael Maslowski have proposed CourtGuard, a unique, locally-runnable, multiagent prompt injection classifier. Unlike many enterprise solutions that might be costly or require external services, CourtGuard is designed to be easily implemented and deployed internally, using local LLMs.
The core idea behind CourtGuard is a court-like system. When a prompt is submitted, it’s evaluated by three distinct LLM “agents”:
Defense Attorney Model: This agent argues that the prompt is harmless and not a prompt injection.
Prosecution Attorney Model: This agent argues that the prompt is indeed a prompt injection.
Judge Model: After considering both the defense and prosecution arguments, this agent delivers the final verdict on whether the prompt is a prompt injection.
This multiagent framework allows CourtGuard to thoroughly examine prompts by considering both benign and adversarial interpretations, aiming for a more balanced and reasoned classification.
How CourtGuard Stacks Up Against Other Defenses
The researchers evaluated CourtGuard against a “Direct Detector,” which is essentially a single LLM acting as a judge, directly classifying prompts. They used various datasets, including LLMail-Inject for real-world attacks and NotInject for benign prompts containing trigger words.
A key finding was that CourtGuard demonstrated a lower false positive rate than the Direct Detector. This means it was better at correctly identifying benign prompts, reducing the chances of legitimate user inputs being wrongly flagged as malicious. This is crucial for user experience and application usability. However, the Direct Detector generally proved to be a better overall prompt injection detector, particularly in identifying actual attacks from the LLMail-Inject dataset, though there was an exception with the Phi model where CourtGuard performed better.
When compared to other solutions on the NotInject benchmark, CourtGuard (using Llama or Phi models) performed very well, even surpassing some well-known enterprise solutions like Meta’s PromptGuard and LakeraGuard. Only Meta’s LlamaGuard3 showed superior performance on this specific benchmark. On the Qualifire Prompt Injection Benchmark, enterprise solutions like Qualifire’s Sentinel model generally outperformed CourtGuard, but these are often larger, more complex systems with significantly lower inference times.
The Importance of Deliberation
The qualitative analysis revealed a significant difference in how CourtGuard and the Direct Detector operate. The Direct Detector often seemed to “assume” a classification early in its reasoning process, potentially relying on implicit knowledge from its training. In contrast, CourtGuard’s judge model, by design, explicitly considers both the defense and prosecution arguments before making a decision. This forced deliberation helps reduce false positives but might also explain why it sometimes has a lower true positive rate for prompt injections, as it has to articulate its reasoning rather than relying on hidden encodings.
Also Read:
- Dynamic Defense for LLM-Powered Multi-Agent Systems
- BreakFun: Unmasking LLM Vulnerabilities Through Structured Data Exploitation
Looking Ahead
While CourtGuard represents a promising step forward for local, multiagent prompt injection defense, it has limitations. Inference time for local LLMs can be several seconds, which researchers hope to optimize through parallelization, smaller models, and quantization. Additionally, the current evaluation focused on static, singular prompt injection attacks, and future work will need to test its robustness against multi-turn conversations and adaptive attacks in real-world scenarios.
Nevertheless, CourtGuard highlights the potential of multiagent systems and local LLMs in creating effective prompt injection defenses, especially for applications handling sensitive data where latency is not the absolute highest priority. Developers in such data-sensitive enterprises should consider these evolving, locally runnable approaches for enhanced security. You can find more details about CourtGuard and its implementation at the research paper.


