TLDR: DREAM is a new framework for red teaming text-to-image AI models. Unlike previous methods that optimize individual prompts, DREAM models the probabilistic distribution of problematic prompts. This allows it to efficiently discover a large, diverse set of prompts that can bypass safety filters and generate harmful content. The framework uses an energy-based model and a novel optimization algorithm (GC-SPSA) to achieve high success rates and diversity, proving effective across various T2I models and commercial platforms, and also improving subsequent safety fine-tuning.
Text-to-image (T2I) generative models have rapidly transformed how we create visual content, allowing users to generate high-quality images from simple text descriptions. These powerful tools, like Stable Diffusion and DALL·E 3, are widely adopted across various fields, from creative arts to social media. However, their training on vast, often unfiltered datasets means they can inadvertently learn to produce harmful content, such as sexual or violent imagery, raising significant ethical and safety concerns.
To address these risks, a crucial practice known as “red teaming” has emerged. Red teaming involves proactively identifying diverse prompts that can trick a T2I system into generating unsafe outputs, despite built-in safety measures like content filters. This process is essential for assessing and improving the safety of these AI systems before they are deployed for public use.
Existing automated red teaming methods often treat the discovery of problematic prompts as an isolated, prompt-by-prompt optimization task. This approach has several limitations: it can be slow, lacks diversity in the prompts it finds, and is generally not scalable for large-scale safety assessments. Imagine trying to find every single weak spot in a massive wall by poking it one brick at a time – it’s inefficient.
Introducing DREAM: A New Paradigm for Red Teaming
A new framework called DREAM (Distributional Red Teaming via Energy-based Modeling) aims to overcome these limitations. Unlike previous methods that optimize prompts individually, DREAM takes a fundamentally different approach: it directly models the probabilistic distribution of the target system’s problematic prompts. Think of it as learning the ‘pattern’ of all possible weak spots in the wall, rather than just finding one at a time.
This innovative formulation offers several key advantages. By modeling the distribution, DREAM can explicitly optimize for both the effectiveness of the prompts (how likely they are to generate unsafe content) and their diversity (how varied and unique the prompts are). Once trained, the system can efficiently sample a large number of diverse problematic prompts, making it highly scalable for real-world applications.
How DREAM Works
DREAM draws inspiration from energy-based models, reformulating the complex objective into simpler, manageable goals. It uses an “energy function” to guide its learning process, assigning lower energy to more desirable (i.e., problematic) prompts. This function incorporates two main components:
- Vision-level Harmfulness Energy: This component evaluates the generated image itself to see how well it aligns with a predefined harmful concept (e.g., “an image containing nudity”). It uses a vision-language model to ensure reliability across different image styles.
- Prompt-level Diversity Energy: To ensure the generated prompts are not repetitive, this component explicitly encourages semantic diversity among the prompts. It measures the similarity between generated prompts and penalizes those that are too alike, pushing the system to explore a broader range of unsafe expressions.
Optimizing this energy function is challenging due to the complex nature of T2I pipelines. To tackle this, DREAM introduces GC-SPSA (Gradient-Calibrated Simultaneous Perturbation Stochastic Approximation), an efficient optimization algorithm. This method estimates gradients using only forward evaluations, avoiding the memory-intensive and often non-differentiable backpropagation process. GC-SPSA also includes a history-aware calibration mechanism to ensure stable and efficient training, even with the inherent randomness of AI model generation.
During the inference phase, when generating prompts, DREAM employs an adaptive temperature scaling strategy. This technique further enhances diversity by penalizing frequently used tokens, encouraging the model to produce more unique and underexplored prompts.
Also Read:
- GIFT: A New Defense for Diffusion Models Against Malicious Fine-Tuning
- DeRAG: How Short Prompts Can Hijack AI’s Information Retrieval
Validation and Impact
Extensive experiments have validated DREAM’s effectiveness. It consistently outperforms nine state-of-the-art baselines across a wide range of T2I models and safety filters, demonstrating superior prompt success rates and achieving human-level diversity. Notably, DREAM successfully uncovered failure cases in four real-world commercial T2I systems, including Ideogram, DeepAI, DALL·E 3, and Midjourney, even with their undisclosed safety mechanisms.
Furthermore, prompts generated by DREAM significantly enhance safety fine-tuning, enabling T2I models to become more robust against both seen and unseen harmful prompts. This suggests that DREAM’s global modeling approach helps improve the diversity and coverage of discovered prompts, leading to more generalizable safety improvements.
The framework also shows strong reusability. A red team LLM trained on one T2I model can be efficiently adapted to other similar systems with minimal additional training, reducing computational overhead. This indicates that DREAM learns a holistic understanding of the distribution of unsafe prompts, making its knowledge transferable.
While DREAM represents a significant step forward in AI safety, the researchers acknowledge limitations, such as potential biases inherited from auxiliary models and the moderate computational cost of training, which is amortized over large-scale prompt generation. However, the framework provides a valuable foundation for future improvements in systematic safety evaluation for T2I systems. For more technical details, you can refer to the full research paper.


