GUARD: A New Framework for Evaluating LLM Compliance and Identifying Vulnerabilities

TLDR: GUARD is a novel testing method that operationalizes high-level government AI ethics guidelines into specific, actionable questions to assess LLM compliance. It uses adaptive role-playing LLMs to generate guideline-violating questions and integrates ‘jailbreak diagnostics’ (GUARD-JD) to uncover scenarios where LLMs might bypass safety mechanisms. Validated across various LLMs and even vision-language models, GUARD provides a comprehensive approach to identifying and reporting non-compliance, contributing to the development of safer AI applications.

Large Language Models (LLMs) are becoming increasingly important across many fields, but their ability to produce harmful content is a growing concern for society and regulators. Governments have responded by issuing ethical guidelines to foster the development of trustworthy AI. However, these guidelines are often high-level, making it difficult for developers and testers to translate them into practical testing questions to ensure LLM compliance.

To tackle this challenge, researchers have introduced GUARD, which stands for Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics. This innovative testing method is designed to transform abstract guidelines into specific questions that can reveal if an LLM is violating these standards. GUARD achieves this by automatically generating guideline-violating questions based on government-issued ethics, and then checking if the LLM’s responses comply.

When an LLM directly violates a guideline, GUARD flags these inconsistencies. More subtly, for responses that don’t immediately appear to violate guidelines, GUARD incorporates a technique called “jailbreak diagnostics,” known as GUARD-JD. This creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential ways to bypass the LLM’s built-in safety mechanisms. The entire process culminates in a comprehensive compliance report, detailing the extent of adherence and highlighting any violations.

The effectiveness of GUARD has been rigorously tested on seven different LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7. These tests were conducted under three government-issued guidelines, alongside jailbreak diagnostics. Impressively, GUARD-JD can also extend its jailbreak diagnostics to vision-language models like MiniGPT-v2 and Gemini-1.5, showcasing its versatility in enhancing the reliability of LLM-based applications across different modalities.

The methodology of GUARD involves two main stages. The first stage focuses on generating guideline-violating questions. This is done using a team of LLMs that adapt to various roles: an Analyst extracts key features from guidelines, a Strategic Committee maps these features to domains and scenarios, a Question Designer converts scenarios into test questions, and a Question Reviewer evaluates these questions for harmfulness, information density, and compliance. If a question doesn’t meet the criteria, it’s refined iteratively.

The second stage, jailbreak diagnostics, comes into play when an LLM provides a guideline-adhering answer to a question. This stage uses the concept of “jailbreaks” to create challenging scenarios that might cause the LLM to fail. Here, a Generator reorganizes jailbreak fragments into coherent “playing scenarios,” an Evaluator calculates the semantic similarity between the LLM’s response and a desired guideline-adhering answer, and an Optimizer provides advice to the Generator to minimize this similarity score, effectively pushing the LLM to generate a guideline-violating response. This iterative process continues until a successful jailbreak is achieved, or the maximum iterations are reached. For more in-depth information, you can read the full research paper here.

Experiments showed that models like Vicuna-13B had higher guideline violation rates, particularly in areas like Human Rights and Societal Risks, while GPT-4 demonstrated lower violation rates, indicating better adherence. In jailbreak diagnostics, GUARD-JD consistently outperformed other baseline methods, achieving higher success rates and lower perplexity scores, meaning its generated jailbreaks were more effective and natural-sounding. This suggests that GUARD-JD’s approach of iteratively generating natural language scenarios is more robust than methods relying on character-optimized patterns.

The research also included a human validation study, where participants confirmed that GUARD-generated questions accurately represented ethical violations. This validation reinforces the method’s effectiveness in creating relevant test cases. Furthermore, ablation studies confirmed the critical contribution of each role within GUARD, particularly the Question Reviewer in question generation and the Generator and Optimizer in jailbreak diagnostics.

Also Read:

In conclusion, GUARD offers a formalized approach to compliance testing for LLMs, translating abstract government guidelines into actionable tests. By employing adaptive role-playing LLMs for question generation and sophisticated jailbreak diagnostics, GUARD effectively identifies vulnerabilities and assesses LLM adherence to ethical standards. This work provides valuable insights for developing safer and more reliable LLM-based applications across various domains, including vision-language models.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

GUARD: A New Framework for Evaluating LLM Compliance and Identifying Vulnerabilities

Gen AI News and Updates

Ghana Navigates Complexities in AI Regulatory Development Amidst Coordination Challenges

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates