TLDR: A new framework, Personalized Safety Alignment (PSA), addresses the limitation of uniform safety standards in text-to-image diffusion models by allowing user-specific control over content generation. It uses a novel dataset, Sage, which captures diverse user preferences based on factors like age and beliefs. PSA integrates these profiles into the model, leading to more effective suppression of harmful content and better alignment with individual safety boundaries, as demonstrated by improved performance metrics.
Text-to-image diffusion models have transformed how we create visual content, offering incredible generative capabilities and high-quality images. However, a significant challenge with these models has been their safety mechanisms, which typically apply a uniform standard to all users. This ‘one-size-fits-all’ approach often overlooks the diverse safety boundaries that individuals have, influenced by factors like age, mental health, and personal beliefs.
To address this crucial limitation, researchers have introduced a novel framework called Personalized Safety Alignment (PSA). PSA empowers users with specific control over the safety behaviors of generative models. It achieves this by integrating personalized user profiles directly into the image generation process, allowing the model to adjust its output to match individual safety preferences while still maintaining high image quality.
A key component of the PSA framework is a new dataset named Sage. This dataset is specifically designed to capture user-specific safety preferences. Unlike previous datasets that rely on fixed, global safety standards, Sage incorporates semantically rich safety preferences, providing tailored and precise support for personalized safety training. It includes ten safety-sensitive categories and over 800 harmful concepts, each paired with high-quality images and corresponding prompts. The dataset even simulates 1,000 virtual users, each defined by attributes such as age, gender, religion, and health, to infer their attitudes toward safety concepts.
The PSA framework builds upon existing techniques like Direct Preference Optimization (DPO), adapting it to a personalized diffusion DPO loss. This allows the denoising network to consider the noisy image, text prompt, and a unique user embedding. This user embedding is projected into the diffusion model’s attention layers through a cross-attention adapter, enabling dynamic control over generation based on individual safety profiles, all while preserving the model’s existing safety knowledge.
Experiments demonstrate that PSA significantly outperforms existing safety alignment methods in suppressing harmful content. It consistently achieves lower Inappropriate Probability (IP) scores across various safety benchmarks, including Sage, CoProV2, I2P, and UD datasets. For instance, on the SD v1.5 model, PSA-L5 reduced IP to 0.12 on I2P and 0.09 on UD, a notable improvement over SafetyDPO. While there might be a slight trade-off in image quality (measured by FID) at the highest safety levels, the prompt-image alignment (CLIPScore) remains competitive, indicating that the model still generates content relevant to the prompt.
Beyond general suppression, PSA excels in personalized safety alignment. It shows superior Win Rate and Pass Rate scores, indicating that images generated by PSA better fit a user’s safety boundaries and comply with their preferences compared to base models and other safety methods. The framework offers progressive suppression levels (L1-L5), allowing for fine-grained control where unsafe elements are gradually reduced while preserving core semantics and structure.
Also Read:
- Wukong: An Early Warning System for Unsafe AI-Generated Images
- R1-ACT: A New Method for Activating Safety Knowledge in AI Reasoning Models
While PSA represents a significant leap forward, it currently relies on synthetic user profiles generated by large language models. Future work may explore real-world deployment and adaptive learning from interactive user feedback to further enhance its capabilities. This research marks a crucial step towards creating safer, more user-centered generative AI systems that respect individual differences in content tolerance. You can find more details about this research in the full paper: Personalized Safety Alignment for Text-to-Image Diffusion Models.


