TLDR: The MM-HSD research introduces a new multi-modal model for detecting hate speech in videos. It integrates video frames, audio, speech transcripts, and on-screen text, uniquely employing Cross-Modal Attention (CMA) as an early feature extractor. The model significantly outperforms existing methods on the HateMM dataset, demonstrating that on-screen text as a query, combined with other modalities, effectively identifies hate speech. This approach highlights the importance of comprehensive multi-modal analysis and inter-modal dependencies for robust hate speech detection in video content.
Hate speech has become a pervasive issue across online platforms, and with the rise of video-centric social media, detecting it in videos presents a unique and complex challenge. While text-based hate speech detection (HSD) has been extensively researched, multi-modal approaches, especially for videos, have remained limited. Often, existing methods fail to fully capture the intricate relationships between different modalities like visuals, audio, and text, or they overlook crucial elements such as on-screen text.
A new research paper, “MM-HSD: Multi-Modal Hate Speech Detection in Videos”, introduces a novel model designed to tackle this very problem. Authored by Berta Céspedes-Sarrias, Carlos Collado-Capell, Pablo Rodenas-Ruiz, Olena Hrynenko, and Andrea Cavallaro from EPFL and Idiap Research Institute, this paper presents MM-HSD, a multi-modal model that integrates various data sources to identify hate speech in video content more effectively.
The Challenge of Video Hate Speech
Detecting hate speech in videos is particularly difficult because hateful content can be subtly embedded across multiple channels. It might be in spoken words, visual cues, background audio, or even text displayed directly on the screen. Traditional methods often focus on just one or two of these, missing the full picture. For instance, a video might appear benign visually, but a hateful message could be conveyed through a voiceover or on-screen text. Simple fusion methods, which just combine information from different sources, often fall short because they don’t account for how these different modalities interact and depend on each other.
Introducing MM-HSD: A Comprehensive Multi-Modal Approach
The MM-HSD model addresses these limitations by integrating four key modalities: video frames, audio, text derived from speech transcripts, and text extracted from the frames themselves (on-screen text). What sets MM-HSD apart is its innovative use of Cross-Modal Attention (CMA) as an early feature extractor. CMA is a powerful mechanism that allows the model to focus on specific aspects of one modality that are most relevant to another, helping to uncover hidden hateful cues that might be missed otherwise.
The researchers are the first to apply CMA in this manner for video HSD, systematically comparing different configurations to find the most effective way for modalities to interact. Their findings indicate that performance significantly improves when on-screen text acts as a ‘query’ – meaning it guides the attention – while the other modalities (transcript, audio, and video) serve as ‘keys’ to provide the contextual information.
How MM-HSD Works
The model’s architecture involves several steps. First, raw embeddings (numerical representations) are extracted for each modality using specialized pre-trained models:
- **Video frames**: Processed by a Vision Transformer (ViT) to capture visual context.
- **Audio**: Acoustic features are extracted using wav2vec2, and speech is transcribed using OpenAI’s Whisper model.
- **Speech Transcripts**: The transcribed text is then encoded using Detoxify, a model specifically trained for hate speech detection.
- **On-screen Text**: Text appearing in video frames is extracted using PaddleOCR and then encoded with Detoxify.
These raw embeddings are then fed into the CMA block. In the MM-HSD setup, the output of this CMA block is concatenated with the outputs of individual modality encoders (which further process each modality separately) before a final classification layer determines if the video contains hate speech. This combined approach leverages both the deep cross-modal interactions from CMA and the specialized representations from each individual modality.
Impressive Results on the HateMM Dataset
Experiments conducted on the HateMM dataset, a publicly available collection of labeled videos from the BitChute platform, demonstrate MM-HSD’s superior performance. The model achieved an M-F1 score of 0.874, outperforming state-of-the-art methods. The research also showed that all four modalities contribute uniquely to the detection process, with a noticeable drop in performance when any single modality is removed. The inclusion of CMA as an additional modality proved crucial, significantly boosting performance compared to models without it.
Also Read:
- Unifying Structure and Meaning for Advanced Multimodal Sentiment Analysis
- Adaptive Audio-Visual Speech Recognition for Noisy Environments
Looking Ahead
The MM-HSD model represents a significant step forward in multi-modal hate speech detection in videos. By meticulously integrating diverse modalities and leveraging the power of Cross-Modal Attention, it offers a more robust and accurate solution to a pressing societal problem. Future work may explore converting OCR to speech to further enhance classification or using temporal CMA for frame-level localization, which could improve the explainability of the model by pinpointing exactly which video segments contribute to a hate speech classification. The code for MM-HSD is openly available, encouraging further research and development in this critical area.


