TLDR: This research introduces a two-part strategy to improve multimodal hate detection in memes. First, it optimizes how AI models are prompted, showing that structured prompts and detailed labels enhance performance. Second, it creates a new dataset of “neutral” memes by altering hateful captions while keeping images benign, which helps models learn to avoid misinterpreting non-hateful visuals. The study demonstrates that both prompt design and data quality are crucial for building more robust and fair hate detection systems.
The internet is overflowing with multimodal content, especially memes, which often convey harmful messages through a subtle interplay of text and images. Detecting hateful memes is a significant challenge because harmful intent can be hidden under the guise of humor or satire. While advanced Vision-Language Models (VLMs) show promise, they often struggle with nuanced hate speech and lack support for fine-grained supervision.
A Two-Pronged Approach to Better Detection
Researchers from the National University of Singapore have introduced a novel dual-pronged strategy to enhance multimodal hate detection. Their work focuses on two key areas: optimizing how AI models are prompted and creating a new method for multimodal data augmentation. This research aims to build more robust and fair vision-language models for content moderation.
Optimizing Prompts for Smarter AI
The first part of their approach involves a prompt optimization framework. This framework systematically varies the structure of prompts, the granularity of supervision, and the training modality. They found that the way a prompt is designed and how labels are scaled significantly influence a model’s performance. For instance, using structured prompts, which provide more detailed instructions, improved the robustness of even smaller models. The InternVL2 model achieved the best F1-scores across different settings, demonstrating the power of well-designed prompts.
The study explored different prompt strategies, such as “simple” prompts that ask direct questions about hatefulness, and “category” prompts that define specific subtypes of hate (like misogyny or xenophobia). They also experimented with different label formats: binary (true/false) and scale-based (a score from 0 to 9 indicating hatefulness). To generate these nuanced scale-based labels, they used a “teacher model” (GPT-4o-mini) on a subset of their training data, enriching the dataset with more detailed supervision signals.
Generating Neutral Memes to Reduce Bias
The second, equally important, aspect of their work is a multimodal data augmentation pipeline. This innovative pipeline generates 2,479 “counterfactually neutral memes.” The idea is to take a hateful meme where the image itself is not hateful but the caption is, and then rewrite the hateful caption to be neutral while keeping the original, benign image. This process helps to reduce “spurious correlations,” meaning the model learns not to associate a non-hateful image with a hateful label just because it appeared with a hateful caption in the original dataset.
This pipeline uses a sophisticated multi-agent setup involving both Large Language Models (LLMs) and Vision-Language Models (VLMs). First, it identifies which part of a meme (image, text, or both) is responsible for the hatefulness. If the hate is primarily in the text, a VLM generates a background description of the image, excluding any overlaid text. Then, a generative model (GPT-4o-mini) rewrites the hateful caption into a neutral one, ensuring it remains relevant to the image. Finally, another model (Gemini 2.0 Flash) regenerates the meme by overlaying the new neutral caption onto the original image. This creates a new, non-hateful version of the meme that helps train models to generalize better and avoid biases.
Key Findings and Impact
The researchers conducted extensive experiments using the Facebook Hateful Memes dataset. They found that both prompt optimization and multimodal augmentation significantly improved classification performance, particularly in F1-scores, which are crucial for imbalanced classification tasks like hate detection. The augmented dataset led to noticeable improvements across various unimodal (text-only, vision-only) and multimodal models, including BERT, RoBERTa, and CLIP. This indicates that exposing models to visually and lexically similar non-hateful memes enhances their ability to generalize and become more robust.
Human evaluations confirmed the quality of the augmented data. An 89% agreement rate was observed for the newly scaled labels, validating their reliability. The counterfactually neutral memes were also rated highly for formatting, background alignment, caption alignment, and overall quality, despite a few minor errors like missing captions or semantic drift in some cases. This demonstrates the potential of large multimodal models when used strategically in multi-agent systems for generating high-quality, bias-reducing training data.
Also Read:
- Unmasking Online Rumors: A Deep Dive into Text-Image Correlation for Enhanced Detection
- Unpacking Prompt Sensitivity: A Deep Dive into LLM Robustness
Looking Ahead
This research offers a comprehensive framework that combines advanced prompt design with systematic data augmentation to improve hateful content detection. It highlights that factors like how a task is framed and the composition of training data are as critical as the size of the model itself. Future work could explore computationally less expensive alternatives for multimodal data augmentation and address the limitations of focusing primarily on text-centric hate. The full research paper can be accessed here: Labels or Input? Rethinking Augmentation in Multimodal Hate Detection.


