TLDR: A new study introduces “Rainbow Noise,” a benchmark to stress-test harmful-meme detectors on LGBTQ+ content by combining various text and image corruptions. It finds that models like MemeCLIP and MemeBLIP2 are highly vulnerable to text perturbations. The research also proposes a Text Denoising Adapter (TDA) which significantly improves MemeBLIP2’s robustness, making it the most resilient model tested. The findings highlight the need for targeted architectural improvements to enhance multimodal safety models.
Online memes are a powerful force in shaping public conversation, but they can also be a vehicle for hate and harassment. This is particularly true for LGBTQ+ communities, who face disproportionately high levels of online abuse. A significant challenge in detecting these harmful memes is that attackers often subtly alter either the image, the caption, or both, making them difficult for automated systems to identify.
A recent research paper, Rainbow Noise: Stress-Testing Multimodal Harmful-Meme Detectors on LGBTQ Content, introduces the first comprehensive benchmark designed to evaluate how well harmful-meme detectors withstand these realistic text and image modifications. The study focuses on two leading lightweight multimodal detectors, MemeCLIP and MemeBLIP2, and also includes GPT-4.1 Vision as a reference for state-of-the-art general-purpose models.
The Rainbow Noise Benchmark
The researchers developed a robust testing framework by combining various types of noise. For images, they used three categories of perturbations: Universal Adversarial Perturbations (UAPs), which are designed to fool models; Common Corruptions (ImageNet-C), simulating real-world degradations like blur and noise; and AugMix compositional noise, which creates complex, layered distortions. For text, four families of perturbations were applied: natural and synthetic typos, HotFlip minimal edits (targeted adversarial changes), universal adversarial triggers (short phrases designed to mislead), and back-translation (paraphrasing by translating to another language and back).
The models were tested on the PrideMM dataset, a collection of over 5,000 LGBTQ+ related memes, each annotated for hate speech, target group, stance, and humor. Crucially, no noisy data was used during the training phase, ensuring a true test of robustness.
Key Findings on Model Vulnerabilities
The study revealed several important insights into how these detectors perform under stress. When only image channels were perturbed, all models showed good resilience to Universal Adversarial Perturbations. However, Common Corruptions (ImageNet-C) proved to be the toughest for MemeCLIP and MemeBLIP2, causing noticeable accuracy drops. Interestingly, GPT-4.1 Vision showed remarkable stability, with its performance even slightly improving under image noise, suggesting its ability to focus on broader semantic features.
The text channel, however, proved to be a more significant source of vulnerability for the fine-tuned models. MemeCLIP was most susceptible to character-level adversarial swaps (HotFlip), while MemeBLIP2 was most vulnerable to meaning-preserving paraphrasing (back-translation), highlighting different sensitivities in their text processing. GPT-4.1 Vision, paradoxically, sometimes improved under HotFlip attacks, indicating its generative reasoning might be stabilized by certain types of input noise.
A crucial finding from single-channel ablations was that both MemeCLIP and MemeBLIP2 rely far more heavily on the caption than the image for their discriminative power. Corrupting the text significantly harmed performance across all metrics, whereas corrupting only the image had a negligible effect.
Introducing the Text Denoising Adapter (TDA)
Recognizing MemeBLIP2’s sensitivity to textual perturbations, the researchers introduced a lightweight module called the Text Denoising Adapter (TDA). Integrated after MemeBLIP2’s text projection layer, the TDA acts as an adaptive filter, learning to refine noisy text embeddings into more resilient representations. Its design allows it to selectively apply corrections, ignoring the denoising path for clear captions and applying full correction for noisy ones. This adaptive and residual design ensures that the original information is preserved while targeted refinements are made.
Also Read:
- New Method Extends AI Safety from Text to Images
- Unmasking VLM Vulnerabilities: How Text2VLM Tests AI Safety with Images
Enhanced Robustness with TDA
When both text and image channels were corrupted simultaneously, the baseline MemeBLIP2 was the most fragile. MemeCLIP showed more resilience. However, the addition of the Text Denoising Adapter dramatically improved MemeBLIP2’s robustness. MemeBLIP2+TDA became the most robust model overall, surpassing even MemeCLIP, with significantly reduced average performance drops in accuracy and F1 score. While character-level errors remained a primary vulnerability for MemeBLIP2+TDA, the TDA significantly hardened the model against simultaneous noise.
In conclusion, this research provides a critical benchmark for evaluating the robustness of multimodal harmful-meme detectors, particularly for LGBTQ+ content. It highlights that current models heavily depend on text and are vulnerable to specific types of textual noise. More importantly, it demonstrates that targeted, lightweight architectural interventions like the Text Denoising Adapter offer a powerful and effective path towards building stronger defenses against evolving online abuse tactics.


