Unveiling Hidden Hate: A Multi-Modal AI Model for Video Content

TLDR: The MM-HSD research introduces a new multi-modal model for detecting hate speech in videos. It integrates video frames, audio, speech transcripts, and on-screen text, uniquely employing Cross-Modal Attention (CMA) as an early feature extractor. The model significantly outperforms existing methods on the HateMM dataset, demonstrating that on-screen text as a query, combined with other modalities, effectively identifies hate speech. This approach highlights the importance of comprehensive multi-modal analysis and inter-modal dependencies for robust hate speech detection in video content.

Hate speech has become a pervasive issue across online platforms, and with the rise of video-centric social media, detecting it in videos presents a unique and complex challenge. While text-based hate speech detection (HSD) has been extensively researched, multi-modal approaches, especially for videos, have remained limited. Often, existing methods fail to fully capture the intricate relationships between different modalities like visuals, audio, and text, or they overlook crucial elements such as on-screen text.

A new research paper, “MM-HSD: Multi-Modal Hate Speech Detection in Videos”, introduces a novel model designed to tackle this very problem. Authored by Berta Céspedes-Sarrias, Carlos Collado-Capell, Pablo Rodenas-Ruiz, Olena Hrynenko, and Andrea Cavallaro from EPFL and Idiap Research Institute, this paper presents MM-HSD, a multi-modal model that integrates various data sources to identify hate speech in video content more effectively.

The Challenge of Video Hate Speech

Detecting hate speech in videos is particularly difficult because hateful content can be subtly embedded across multiple channels. It might be in spoken words, visual cues, background audio, or even text displayed directly on the screen. Traditional methods often focus on just one or two of these, missing the full picture. For instance, a video might appear benign visually, but a hateful message could be conveyed through a voiceover or on-screen text. Simple fusion methods, which just combine information from different sources, often fall short because they don’t account for how these different modalities interact and depend on each other.

Introducing MM-HSD: A Comprehensive Multi-Modal Approach

The MM-HSD model addresses these limitations by integrating four key modalities: video frames, audio, text derived from speech transcripts, and text extracted from the frames themselves (on-screen text). What sets MM-HSD apart is its innovative use of Cross-Modal Attention (CMA) as an early feature extractor. CMA is a powerful mechanism that allows the model to focus on specific aspects of one modality that are most relevant to another, helping to uncover hidden hateful cues that might be missed otherwise.

The researchers are the first to apply CMA in this manner for video HSD, systematically comparing different configurations to find the most effective way for modalities to interact. Their findings indicate that performance significantly improves when on-screen text acts as a ‘query’ – meaning it guides the attention – while the other modalities (transcript, audio, and video) serve as ‘keys’ to provide the contextual information.

How MM-HSD Works

The model’s architecture involves several steps. First, raw embeddings (numerical representations) are extracted for each modality using specialized pre-trained models:

**Video frames**: Processed by a Vision Transformer (ViT) to capture visual context.
**Audio**: Acoustic features are extracted using wav2vec2, and speech is transcribed using OpenAI’s Whisper model.
**Speech Transcripts**: The transcribed text is then encoded using Detoxify, a model specifically trained for hate speech detection.
**On-screen Text**: Text appearing in video frames is extracted using PaddleOCR and then encoded with Detoxify.

These raw embeddings are then fed into the CMA block. In the MM-HSD setup, the output of this CMA block is concatenated with the outputs of individual modality encoders (which further process each modality separately) before a final classification layer determines if the video contains hate speech. This combined approach leverages both the deep cross-modal interactions from CMA and the specialized representations from each individual modality.

Impressive Results on the HateMM Dataset

Experiments conducted on the HateMM dataset, a publicly available collection of labeled videos from the BitChute platform, demonstrate MM-HSD’s superior performance. The model achieved an M-F1 score of 0.874, outperforming state-of-the-art methods. The research also showed that all four modalities contribute uniquely to the detection process, with a noticeable drop in performance when any single modality is removed. The inclusion of CMA as an additional modality proved crucial, significantly boosting performance compared to models without it.

Also Read:

Looking Ahead

The MM-HSD model represents a significant step forward in multi-modal hate speech detection in videos. By meticulously integrating diverse modalities and leveraging the power of Cross-Modal Attention, it offers a more robust and accurate solution to a pressing societal problem. Future work may explore converting OCR to speech to further enhance classification or using temporal CMA for frame-level localization, which could improve the explainability of the model by pinpointing exactly which video segments contribute to a hate speech classification. The code for MM-HSD is openly available, encouraging further research and development in this critical area.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unveiling Hidden Hate: A Multi-Modal AI Model for Video Content

The Challenge of Video Hate Speech

Introducing MM-HSD: A Comprehensive Multi-Modal Approach

How MM-HSD Works

Impressive Results on the HateMM Dataset

Looking Ahead

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates