Unpacking T2I-RiskyPrompt: A New Benchmark for Text-to-Image Model Safety

TLDR: T2I-RiskyPrompt is a new, comprehensive benchmark with 6,432 annotated risky prompts across 14 categories, designed to evaluate the safety of text-to-image (T2I) models. It introduces a hierarchical risk taxonomy and a reason-driven detection method, revealing that more capable T2I models pose higher risks, existing defenses have limitations, and current safety filters are vulnerable to various attacks. The benchmark aims to advance research into robust T2I safety mechanisms.

Text-to-Image (T2I) models, which generate images from text descriptions, have become incredibly popular, with millions of users creating billions of images. However, this powerful technology also carries significant risks. Users can exploit these models to create harmful content, such as pornography, violence, or politically sensitive imagery. Ensuring the safety of T2I models is a crucial and urgent task.

Existing methods for evaluating the safety of T2I models have faced several limitations. Many current datasets for risky prompts only cover a narrow range of harmful categories, often focusing primarily on Not Safe For Work (NSFW) content like pornography and violence, while overlooking other critical areas such as political sensitivities or copyright infringement. Furthermore, these datasets often rely on automated tools for labeling, leading to imprecise and generalized annotations without human verification. The linguistic quality of prompts in these datasets can also be low, making them less effective at consistently generating genuinely risky images from T2I models.

To address these challenges, researchers have introduced T2I-RiskyPrompt, a new and comprehensive benchmark designed to improve the safety evaluation of T2I models. This benchmark offers a more structured approach to understanding and mitigating risks.

A New Approach to Risk Categorization

T2I-RiskyPrompt begins by establishing a hierarchical risk taxonomy. This taxonomy was developed by analyzing the usage policies of seven major T2I platforms and commercial services, including Midjourney, Microsoft, CompVis (Stable Diffusion), Meta AI, OpenAI (DALL·E 3), Black Forest Labs (FLUX), and Google. The taxonomy consists of 6 primary risk categories and 14 more detailed subcategories. These categories cover a broad spectrum of risks, including NSFW content (pornography, violence, disturbing material, illegal activities), copyright infringement, and politically sensitive content.

Building a Robust Dataset

The creation of T2I-RiskyPrompt involved a meticulous six-stage pipeline for data collection and annotation. This process ensures both the diversity and effectiveness of the risky prompts. Prompts were collected from various sources, including existing datasets and new prompts generated by GPT-4o for categories like copyright infringement and illegal activities. To enhance clarity and consistency, all prompts underwent a “polishing” process using GPT-4o (and a fine-tuned LLaMA-3 for pornographic content, as GPT-4o refused to handle such prompts directly). Duplicate prompts were removed to ensure diversity. A double-check process, combining GPT-4o with human judgment, was used for accurate category annotation. Crucially, a “validity filtering” stage ensured that only prompts capable of generating the intended risky visual elements were included, using models like Stable Diffusion 3 and FLUX for image generation and manual cross-validation. Finally, detailed risk reasons were annotated by manually identifying the specific visual elements contributing to the risk in the generated images.

The resulting T2I-RiskyPrompt dataset contains 6,432 effective risky prompts across 14 categories. Each prompt is annotated with hierarchical category labels and detailed risk reasons, making it a valuable resource for research.

Reason-Driven Risk Detection

To facilitate evaluation, the researchers also proposed a “reason-driven risky image detection method.” This method explicitly aligns Multi-modal Large Language Models (MLLMs) with the detailed safety annotations. Instead of relying on broad category definitions, the MLLM is provided with specific descriptions of risky visual elements (derived from the detailed risk reasons) and tasked with determining if an image contains them. This approach significantly outperforms existing detectors, achieving 91.8% accuracy for risky images using a 3B MLLM.

Also Read:

Key Insights from Comprehensive Evaluation

Using T2I-RiskyPrompt, a comprehensive evaluation was conducted on eight T2I models, nine defense methods, five safety filters, and five attack strategies. This extensive analysis yielded nine key insights into the strengths and limitations of T2I model safety:

Models with stronger generative capabilities tend to exhibit greater safety risks, as they are better at following complex instructions, including those for risky content.
T2I developers often prioritize mitigating pornographic risks over other safety concerns.
Defending against diverse visual manifestations of risky content, especially copyright infringement, remains a significant challenge.
Tuning-free defense methods struggle to defend against multiple NSFW risk categories simultaneously.
Different risk categories often require distinct defense strategies.
There is a significant trade-off between defense strength and the quality of generated images.
Prompt-level risks are generally easier to identify than image-level risks.
Keyword-based filters are vulnerable to pseudoword-based attacks.
Feature-based filters are vulnerable to LLM-based attacks, which use subtle linguistic associations to bypass detection.

The T2I-RiskyPrompt benchmark and its associated findings are a significant step towards understanding and improving the safety of text-to-image generative models. The dataset and code are publicly available at https://github.com/datar001/T2I-RiskyPrompt, encouraging further research in this critical area.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking T2I-RiskyPrompt: A New Benchmark for Text-to-Image Model Safety

A New Approach to Risk Categorization

Building a Robust Dataset

Reason-Driven Risk Detection

Key Insights from Comprehensive Evaluation

Gen AI News and Updates

Rubrik Report Reveals Alarming Decline in Cyber Resilience Amidst AI Agent Proliferation

Anthropic Reveals First AI-Orchestrated Cyber Espionage Campaign by Chinese State-Sponsored Group

TrojAI Unveils Defend for MCP to Bolster Security for AI Agent Workflows

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates