TLDR: HateClipSeg is a new large-scale, multimodal dataset with over 11,714 video segments annotated for fine-grained hate speech detection, including categories like Hateful, Insulting, Sexual, Violence, and Self-Harm, along with target victim labels. It addresses limitations of previous datasets by providing segment-level annotations and enabling three new tasks: trimmed video classification, temporal localization, and online classification. Benchmark results show current models struggle with the complexity and temporal aspects of hate speech, highlighting the need for more advanced detection systems.
Online hate speech continues to be a significant challenge in our society, especially with the rise of multimodal content that combines text, visuals, and audio. This blend can make harmful messages more subtle or amplify them, making detection incredibly difficult. Current methods and datasets often fall short, providing only broad, video-level labels that don’t capture the specific types of hate or their exact locations within a video.
Introducing HateClipSeg: A New Approach to Hate Video Detection
To address these critical limitations, researchers Han Wang, Zhuoran Wang, and Roy Ka-Wei Lee have introduced HateClipSeg, a groundbreaking large-scale multimodal dataset. This dataset offers fine-grained, segment-level annotations for hate video detection, aiming to bridge the gap between general video labels and the real-world need for precise, temporally localized identification of nuanced hate speech.
HateClipSeg comprises over 11,714 segments, each meticulously labeled as either Normal or falling into one of five Offensive categories: Hateful, Insulting, Sexual, Violence, and Self-Harm. Crucially, it also includes explicit labels for target victim groups, providing a much deeper level of detail than previous datasets.
How HateClipSeg Was Built
The creation of HateClipSeg involved a rigorous three-stage annotation process: independent annotation, paired discussion, and re-annotation. This meticulous approach significantly improved inter-annotator agreement, achieving a high Krippendorff’s alpha of 0.817 for video-level offensive or normal labels. This robust process ensures the high quality and reliability of the dataset’s labels across all annotation types.
The data collection began by compiling a lexicon of over 100 terms and phrases commonly associated with hate speech across categories like race, gender, religion, and sexuality. Using this lexicon, videos were sourced from YouTube and BitChute, a platform known for hosting extremist content. To manage annotation costs and increase the proportion of hateful content, a pre-trained model was used to filter out non-hateful videos. Videos were then automatically divided into semantically coherent segments, making fine-grained annotation possible.
Benchmarking Real-World Challenges
HateClipSeg enables the benchmarking of models across three challenging tasks that reflect real-world content moderation scenarios:
- Trimmed Hateful Video Classification: This task involves predicting a single label for pre-segmented video clips, serving as a baseline for identifying offensive content within isolated segments.
- Temporal Hateful Video Localization: This task focuses on identifying offensive segments along with their precise start and end timestamps within untrimmed videos. This is crucial for pinpointing harmful content embedded in longer videos.
- Online Hateful Video Classification: This task simulates real-time content moderation by requiring models to predict labels for streaming video, relying only on past and current input without knowledge of future frames.
The results from benchmarking state-of-the-art models on HateClipSeg highlight substantial gaps in current capabilities. While models showed moderate performance in trimmed video classification, their accuracy dropped sharply in temporal localization and remained limited in online classification. This underscores the inherent complexity of segment-level detection in multimodal streams and the need for more sophisticated, temporally aware, and multimodal approaches.
Also Read:
- Advancing Emotion Understanding with Multimodal AI: A Deep Dive into Language Models
- DRKF: Advancing Emotion Recognition Through Decoupled Representations and Knowledge Fusion
The Path Forward
HateClipSeg represents a significant step forward in multimodal hate speech detection research. By providing a comprehensive resource with fine-grained, segment-level annotations, it facilitates the development and evaluation of models capable of nuanced and precise hate speech identification. The dataset and accompanying benchmarks are publicly available, encouraging further research and innovation in this critical area. For more details, you can refer to the research paper: HateClipSeg: A Segment-Level Annotated Dataset for Fine-Grained Hate Video Detection.
The researchers also emphasize ethical considerations, noting that videos were sourced from publicly accessible platforms and only video IDs are shared to respect privacy. Annotators were warned about sensitive content and provided with psychological support, ensuring their well-being throughout the process.


