spot_img
HomeResearch & DevelopmentORFuzz: A New Approach to Uncover Over-Refusal in Large...

ORFuzz: A New Approach to Uncover Over-Refusal in Large Language Models

TLDR: ORFuzz is a novel evolutionary testing framework designed to detect ‘over-refusal’ in Large Language Models (LLMs), where models erroneously reject benign queries due to overly cautious safety measures. It integrates safety category-aware seed selection, adaptive mutator optimization, and a human-aligned judge model (OR-JUDGE) to generate effective test cases. ORFuzz significantly outperforms existing methods in detecting over-refusal and has led to the creation of ORFuzzSET, a new benchmark dataset of 1,855 transferable test cases that effectively trigger over-refusal across various LLMs.

Large Language Models (LLMs) are becoming increasingly common in critical applications, from healthcare to legal systems. While developers implement safety measures to prevent harmful content, an unintended side effect has emerged: ‘over-refusal’. This occurs when an LLM incorrectly rejects a harmless or benign query because its safety mechanisms are too cautious. For example, a coding assistant might refuse to explain ‘how to kill a Python process’ due to the word ‘kill’, even though the intent is purely technical and harmless. This over-cautiousness is a significant functional flaw that undermines the reliability and usability of these powerful AI systems.

Current methods for testing this over-refusal behavior are often inadequate. Existing benchmarks, which are static collections of prompts, frequently contain queries that humans actually perceive as harmful, making them unreliable for truly testing over-refusal. Furthermore, the process of manually creating test cases is slow, inconsistent, and doesn’t cover enough scenarios. This highlights a critical need for a more dynamic and automated approach to uncover these vulnerabilities.

Introducing ORFuzz: A Novel Testing Framework

To address this gap, researchers have introduced ORFuzz, a pioneering evolutionary testing framework designed specifically for systematically detecting and analyzing LLM over-refusals. ORFuzz operates through a unique integration of three core components, working together in an iterative feedback loop to find instances where LLMs erroneously reject benign prompts.

The first component is Safety Category-Aware Seed Selection. Instead of random sampling, ORFuzz classifies queries into eight safety-relevant categories, such as ‘Crimes and Illegal Activities’ or ‘Ethics and Morality’. It then uses an advanced exploration algorithm to select diverse and representative starting queries (seeds) from these categories. This ensures that the testing covers a wide range of potential over-refusal scenarios, recognizing that different LLMs might be sensitive to different types of content.

The second key component is Adaptive Mutator Optimization. Mutators are tools that transform existing queries into new test cases. ORFuzz employs three types of specialized mutators: General (for linguistic variations), Sensitive Word (for introducing or replacing sensitive terms), and Scenario/Task (for changing context or intent). What makes this adaptive is an ‘analyze-generate-feedback’ loop, powered by reasoning LLMs. This loop dynamically refines the mutator prompts, ensuring that the generated test cases become progressively more effective at triggering over-refusal behaviors.

Finally, OR-JUDGE serves as a human-aligned judge model for validating test outcomes. This configurable evaluator was fine-tuned using thousands of human-labeled query-response pairs, allowing it to accurately assess both the toxicity of content and the rationality of a model’s refusal. OR-JUDGE acts as a sophisticated oracle, reliably determining whether a detected refusal truly constitutes an over-refusal from a human perspective.

Also Read:

Key Findings and Impact

Extensive evaluations demonstrate ORFuzz’s superior performance. It generates diverse and validated over-refusal instances at an average rate of 6.98%, which is more than double that of leading baseline methods. This effectiveness allows ORFuzz to uncover vulnerabilities that other approaches miss. The research also revealed that different LLMs exhibit varying over-refusal rates across different safety categories, emphasizing the need for model-specific testing strategies.

Beyond its testing capabilities, ORFuzz has also contributed to the creation of ORFuzzSET, a new benchmark dataset. This dataset comprises 1,855 highly transferable test cases derived from ORFuzz’s outputs that successfully triggered over-refusal in multiple LLMs. ORFuzzSET achieves a superior average over-refusal rate of 63.56% across 10 diverse LLMs, significantly outperforming existing datasets and providing a valuable resource for the AI community.

In conclusion, ORFuzz provides a robust, automated framework for rigorously assessing and improving the reliability and trustworthiness of LLM-based software systems by tackling the critical issue of over-refusal. The code for this paper is available for the community to use and build upon. You can find more details about the research paper here: ORF UZZ : Fuzzing the “Other Side” of LLM Safety – Testing Over-Refusal.

Dev Sundaram
Dev Sundaramhttps://blogs.edgentiq.com
Dev Sundaram is an investigative tech journalist with a nose for exclusives and leaks. With stints in cybersecurity and enterprise AI reporting, Dev thrives on breaking big stories—product launches, funding rounds, regulatory shifts—and giving them context. He believes journalism should push the AI industry toward transparency and accountability, especially as Generative AI becomes mainstream. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -