TLDR: GUARD is a novel testing method that operationalizes high-level government AI ethics guidelines into specific, actionable questions to assess LLM compliance. It uses adaptive role-playing LLMs to generate guideline-violating questions and integrates ‘jailbreak diagnostics’ (GUARD-JD) to uncover scenarios where LLMs might bypass safety mechanisms. Validated across various LLMs and even vision-language models, GUARD provides a comprehensive approach to identifying and reporting non-compliance, contributing to the development of safer AI applications.
Large Language Models (LLMs) are becoming increasingly important across many fields, but their ability to produce harmful content is a growing concern for society and regulators. Governments have responded by issuing ethical guidelines to foster the development of trustworthy AI. However, these guidelines are often high-level, making it difficult for developers and testers to translate them into practical testing questions to ensure LLM compliance.
To tackle this challenge, researchers have introduced GUARD, which stands for Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics. This innovative testing method is designed to transform abstract guidelines into specific questions that can reveal if an LLM is violating these standards. GUARD achieves this by automatically generating guideline-violating questions based on government-issued ethics, and then checking if the LLM’s responses comply.
When an LLM directly violates a guideline, GUARD flags these inconsistencies. More subtly, for responses that don’t immediately appear to violate guidelines, GUARD incorporates a technique called “jailbreak diagnostics,” known as GUARD-JD. This creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential ways to bypass the LLM’s built-in safety mechanisms. The entire process culminates in a comprehensive compliance report, detailing the extent of adherence and highlighting any violations.
The effectiveness of GUARD has been rigorously tested on seven different LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7. These tests were conducted under three government-issued guidelines, alongside jailbreak diagnostics. Impressively, GUARD-JD can also extend its jailbreak diagnostics to vision-language models like MiniGPT-v2 and Gemini-1.5, showcasing its versatility in enhancing the reliability of LLM-based applications across different modalities.
The methodology of GUARD involves two main stages. The first stage focuses on generating guideline-violating questions. This is done using a team of LLMs that adapt to various roles: an Analyst extracts key features from guidelines, a Strategic Committee maps these features to domains and scenarios, a Question Designer converts scenarios into test questions, and a Question Reviewer evaluates these questions for harmfulness, information density, and compliance. If a question doesn’t meet the criteria, it’s refined iteratively.
The second stage, jailbreak diagnostics, comes into play when an LLM provides a guideline-adhering answer to a question. This stage uses the concept of “jailbreaks” to create challenging scenarios that might cause the LLM to fail. Here, a Generator reorganizes jailbreak fragments into coherent “playing scenarios,” an Evaluator calculates the semantic similarity between the LLM’s response and a desired guideline-adhering answer, and an Optimizer provides advice to the Generator to minimize this similarity score, effectively pushing the LLM to generate a guideline-violating response. This iterative process continues until a successful jailbreak is achieved, or the maximum iterations are reached. For more in-depth information, you can read the full research paper here.
Experiments showed that models like Vicuna-13B had higher guideline violation rates, particularly in areas like Human Rights and Societal Risks, while GPT-4 demonstrated lower violation rates, indicating better adherence. In jailbreak diagnostics, GUARD-JD consistently outperformed other baseline methods, achieving higher success rates and lower perplexity scores, meaning its generated jailbreaks were more effective and natural-sounding. This suggests that GUARD-JD’s approach of iteratively generating natural language scenarios is more robust than methods relying on character-optimized patterns.
The research also included a human validation study, where participants confirmed that GUARD-generated questions accurately represented ethical violations. This validation reinforces the method’s effectiveness in creating relevant test cases. Furthermore, ablation studies confirmed the critical contribution of each role within GUARD, particularly the Question Reviewer in question generation and the Generator and Optimizer in jailbreak diagnostics.
Also Read:
- JailExpert: A New Framework for Automated LLM Jailbreaking Through Experience
- Deconstructing Jailbreaks: A New Framework for Accurate LLM Security Assessment
In conclusion, GUARD offers a formalized approach to compliance testing for LLMs, translating abstract government guidelines into actionable tests. By employing adaptive role-playing LLMs for question generation and sophisticated jailbreak diagnostics, GUARD effectively identifies vulnerabilities and assesses LLM adherence to ethical standards. This work provides valuable insights for developing safer and more reliable LLM-based applications across various domains, including vision-language models.


