TLDR: T2I-RiskyPrompt is a new, comprehensive benchmark with 6,432 annotated risky prompts across 14 categories, designed to evaluate the safety of text-to-image (T2I) models. It introduces a hierarchical risk taxonomy and a reason-driven detection method, revealing that more capable T2I models pose higher risks, existing defenses have limitations, and current safety filters are vulnerable to various attacks. The benchmark aims to advance research into robust T2I safety mechanisms.
Text-to-Image (T2I) models, which generate images from text descriptions, have become incredibly popular, with millions of users creating billions of images. However, this powerful technology also carries significant risks. Users can exploit these models to create harmful content, such as pornography, violence, or politically sensitive imagery. Ensuring the safety of T2I models is a crucial and urgent task.
Existing methods for evaluating the safety of T2I models have faced several limitations. Many current datasets for risky prompts only cover a narrow range of harmful categories, often focusing primarily on Not Safe For Work (NSFW) content like pornography and violence, while overlooking other critical areas such as political sensitivities or copyright infringement. Furthermore, these datasets often rely on automated tools for labeling, leading to imprecise and generalized annotations without human verification. The linguistic quality of prompts in these datasets can also be low, making them less effective at consistently generating genuinely risky images from T2I models.
To address these challenges, researchers have introduced T2I-RiskyPrompt, a new and comprehensive benchmark designed to improve the safety evaluation of T2I models. This benchmark offers a more structured approach to understanding and mitigating risks.
A New Approach to Risk Categorization
T2I-RiskyPrompt begins by establishing a hierarchical risk taxonomy. This taxonomy was developed by analyzing the usage policies of seven major T2I platforms and commercial services, including Midjourney, Microsoft, CompVis (Stable Diffusion), Meta AI, OpenAI (DALL·E 3), Black Forest Labs (FLUX), and Google. The taxonomy consists of 6 primary risk categories and 14 more detailed subcategories. These categories cover a broad spectrum of risks, including NSFW content (pornography, violence, disturbing material, illegal activities), copyright infringement, and politically sensitive content.
Building a Robust Dataset
The creation of T2I-RiskyPrompt involved a meticulous six-stage pipeline for data collection and annotation. This process ensures both the diversity and effectiveness of the risky prompts. Prompts were collected from various sources, including existing datasets and new prompts generated by GPT-4o for categories like copyright infringement and illegal activities. To enhance clarity and consistency, all prompts underwent a “polishing” process using GPT-4o (and a fine-tuned LLaMA-3 for pornographic content, as GPT-4o refused to handle such prompts directly). Duplicate prompts were removed to ensure diversity. A double-check process, combining GPT-4o with human judgment, was used for accurate category annotation. Crucially, a “validity filtering” stage ensured that only prompts capable of generating the intended risky visual elements were included, using models like Stable Diffusion 3 and FLUX for image generation and manual cross-validation. Finally, detailed risk reasons were annotated by manually identifying the specific visual elements contributing to the risk in the generated images.
The resulting T2I-RiskyPrompt dataset contains 6,432 effective risky prompts across 14 categories. Each prompt is annotated with hierarchical category labels and detailed risk reasons, making it a valuable resource for research.
Reason-Driven Risk Detection
To facilitate evaluation, the researchers also proposed a “reason-driven risky image detection method.” This method explicitly aligns Multi-modal Large Language Models (MLLMs) with the detailed safety annotations. Instead of relying on broad category definitions, the MLLM is provided with specific descriptions of risky visual elements (derived from the detailed risk reasons) and tasked with determining if an image contains them. This approach significantly outperforms existing detectors, achieving 91.8% accuracy for risky images using a 3B MLLM.
Also Read:
- Generating Effective Adversarial Examples from Natural Language Instructions
- New Study Reveals Significant CBRN Safety Gaps in Leading AI Models
Key Insights from Comprehensive Evaluation
Using T2I-RiskyPrompt, a comprehensive evaluation was conducted on eight T2I models, nine defense methods, five safety filters, and five attack strategies. This extensive analysis yielded nine key insights into the strengths and limitations of T2I model safety:
- Models with stronger generative capabilities tend to exhibit greater safety risks, as they are better at following complex instructions, including those for risky content.
- T2I developers often prioritize mitigating pornographic risks over other safety concerns.
- Defending against diverse visual manifestations of risky content, especially copyright infringement, remains a significant challenge.
- Tuning-free defense methods struggle to defend against multiple NSFW risk categories simultaneously.
- Different risk categories often require distinct defense strategies.
- There is a significant trade-off between defense strength and the quality of generated images.
- Prompt-level risks are generally easier to identify than image-level risks.
- Keyword-based filters are vulnerable to pseudoword-based attacks.
- Feature-based filters are vulnerable to LLM-based attacks, which use subtle linguistic associations to bypass detection.
The T2I-RiskyPrompt benchmark and its associated findings are a significant step towards understanding and improving the safety of text-to-image generative models. The dataset and code are publicly available at https://github.com/datar001/T2I-RiskyPrompt, encouraging further research in this critical area.


