TLDR: A new research paper introduces Prompt-Restricted Multi-modal Attack (PReMA), an attack that exploits a misalignment between text and image modalities in multi-modal diffusion models. Unlike previous methods that modify text prompts, PReMA manipulates the input image to generate inappropriate content, even with benign prompts, effectively bypassing existing safety checkers and posing a significant new threat to AI image generation security.
Multi-modal diffusion models, which are powerful AI systems capable of generating images from text, have shown remarkable progress. However, new research reveals a significant and previously underexplored security risk: a misalignment between the text and image information these models process. This flaw can be exploited to generate inappropriate or Not-Safe-For-Work (NSFW) content, even when given perfectly safe instructions.
The Hidden Vulnerability
Traditionally, security concerns in these models focused on adversarial prompts – manipulating the text input to trick the AI into creating harmful images. However, a paper titled “Security Risk of Misalignment between Text and Image in Multi-modal Model” by Xiaosen Wang, Zhijin Ge, and Shaokang Wang, highlights that the alignment between text and image modalities in existing diffusion models is often inadequate. This means the model’s understanding of an image might not perfectly match its understanding of the accompanying text, creating a loophole for malicious manipulation.
Introducing PReMA: A Novel Attack
To demonstrate this vulnerability, the researchers propose a new attack called Prompt-Restricted Multi-modal Attack (PReMA). What makes PReMA unique is that it manipulates the generated content by modifying the *input image* itself, rather than altering the text prompt. This is a crucial distinction, as previous attacks primarily focused on crafting adversarial prompts. PReMA can create adversarial images that, when combined with any specified (even benign) prompt, lead the model to generate unintended and often inappropriate outputs.
The attack works by subtly altering pixels in the input image. These changes are often imperceptible to the human eye but are significant enough to mislead the diffusion model. The paper explains that this is possible because the image modality, despite being a key input, has been largely overlooked in previous attack and defense strategies. This oversight is particularly concerning for applications that use fixed prompts for image editing, where traditional prompt-based defenses would be ineffective.
How PReMA Bypasses Defenses
Current safety measures in diffusion models typically involve two main components: input safety checkers (which scan prompts for sensitive words) and output safety checkers (which evaluate generated images for NSFW content). PReMA effectively sidesteps input checkers because it uses benign prompts. For output checkers, PReMA incorporates an additional optimization step during the attack process, making the generated NSFW content harder for these checkers to detect. This significantly enhances its ability to bypass existing safeguards.
Extensive Evaluation and Impact
The researchers conducted comprehensive evaluations across various tasks and models, including image inpainting (filling in parts of an image) and style transfer (changing the artistic style of an image). They tested PReMA on popular models like Stable Diffusion (SDv1.5, SDv2.0) and Kandinsky (KDv2.1, KDv2.2). The results consistently showed that PReMA could effectively induce NSFW content with high success rates, even with harmless prompts. The attack also demonstrated robustness across different prompts and a degree of transferability between certain models, meaning an adversarial image crafted for one model might also affect another.
This work highlights a significant security vulnerability in current multi-modal diffusion models. It underscores the importance of considering the image modality in safety alignment efforts, as focusing solely on text prompts leaves a critical gap. The findings suggest that developers need to re-evaluate and strengthen their defensive measures to account for image-based manipulations. For more technical details, you can read the full research paper here.
Also Read:
- Guiding Text-to-Image Models Towards Safer Content Without Retraining
- Chain-of-Thought Hijacking: A New Vulnerability in Advanced AI Reasoning Models
Future Challenges
While PReMA presents a potent new threat, the researchers acknowledge limitations, particularly its poor transferability across vastly different model architectures. Improving this transferability and developing black-box adversarial perturbation techniques are areas for future work. This research serves as a crucial warning, urging the AI community to address the complex interplay between text and image modalities to ensure the responsible development and deployment of diffusion models.


