ORFuzz: A New Approach to Uncover Over-Refusal in Large Language Models

TLDR: ORFuzz is a novel evolutionary testing framework designed to detect ‘over-refusal’ in Large Language Models (LLMs), where models erroneously reject benign queries due to overly cautious safety measures. It integrates safety category-aware seed selection, adaptive mutator optimization, and a human-aligned judge model (OR-JUDGE) to generate effective test cases. ORFuzz significantly outperforms existing methods in detecting over-refusal and has led to the creation of ORFuzzSET, a new benchmark dataset of 1,855 transferable test cases that effectively trigger over-refusal across various LLMs.

Large Language Models (LLMs) are becoming increasingly common in critical applications, from healthcare to legal systems. While developers implement safety measures to prevent harmful content, an unintended side effect has emerged: ‘over-refusal’. This occurs when an LLM incorrectly rejects a harmless or benign query because its safety mechanisms are too cautious. For example, a coding assistant might refuse to explain ‘how to kill a Python process’ due to the word ‘kill’, even though the intent is purely technical and harmless. This over-cautiousness is a significant functional flaw that undermines the reliability and usability of these powerful AI systems.

Current methods for testing this over-refusal behavior are often inadequate. Existing benchmarks, which are static collections of prompts, frequently contain queries that humans actually perceive as harmful, making them unreliable for truly testing over-refusal. Furthermore, the process of manually creating test cases is slow, inconsistent, and doesn’t cover enough scenarios. This highlights a critical need for a more dynamic and automated approach to uncover these vulnerabilities.

Introducing ORFuzz: A Novel Testing Framework

To address this gap, researchers have introduced ORFuzz, a pioneering evolutionary testing framework designed specifically for systematically detecting and analyzing LLM over-refusals. ORFuzz operates through a unique integration of three core components, working together in an iterative feedback loop to find instances where LLMs erroneously reject benign prompts.

The first component is Safety Category-Aware Seed Selection. Instead of random sampling, ORFuzz classifies queries into eight safety-relevant categories, such as ‘Crimes and Illegal Activities’ or ‘Ethics and Morality’. It then uses an advanced exploration algorithm to select diverse and representative starting queries (seeds) from these categories. This ensures that the testing covers a wide range of potential over-refusal scenarios, recognizing that different LLMs might be sensitive to different types of content.

The second key component is Adaptive Mutator Optimization. Mutators are tools that transform existing queries into new test cases. ORFuzz employs three types of specialized mutators: General (for linguistic variations), Sensitive Word (for introducing or replacing sensitive terms), and Scenario/Task (for changing context or intent). What makes this adaptive is an ‘analyze-generate-feedback’ loop, powered by reasoning LLMs. This loop dynamically refines the mutator prompts, ensuring that the generated test cases become progressively more effective at triggering over-refusal behaviors.

Finally, OR-JUDGE serves as a human-aligned judge model for validating test outcomes. This configurable evaluator was fine-tuned using thousands of human-labeled query-response pairs, allowing it to accurately assess both the toxicity of content and the rationality of a model’s refusal. OR-JUDGE acts as a sophisticated oracle, reliably determining whether a detected refusal truly constitutes an over-refusal from a human perspective.

Also Read:

Key Findings and Impact

Extensive evaluations demonstrate ORFuzz’s superior performance. It generates diverse and validated over-refusal instances at an average rate of 6.98%, which is more than double that of leading baseline methods. This effectiveness allows ORFuzz to uncover vulnerabilities that other approaches miss. The research also revealed that different LLMs exhibit varying over-refusal rates across different safety categories, emphasizing the need for model-specific testing strategies.

Beyond its testing capabilities, ORFuzz has also contributed to the creation of ORFuzzSET, a new benchmark dataset. This dataset comprises 1,855 highly transferable test cases derived from ORFuzz’s outputs that successfully triggered over-refusal in multiple LLMs. ORFuzzSET achieves a superior average over-refusal rate of 63.56% across 10 diverse LLMs, significantly outperforming existing datasets and providing a valuable resource for the AI community.

In conclusion, ORFuzz provides a robust, automated framework for rigorously assessing and improving the reliability and trustworthiness of LLM-based software systems by tackling the critical issue of over-refusal. The code for this paper is available for the community to use and build upon. You can find more details about the research paper here: ORF UZZ : Fuzzing the “Other Side” of LLM Safety – Testing Over-Refusal.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

ORFuzz: A New Approach to Uncover Over-Refusal in Large Language Models

Introducing ORFuzz: A Novel Testing Framework

Key Findings and Impact

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates