TLDR: Researchers introduce M3HG, a novel model that uses a multimodal, multi-scale, and multi-type node heterogeneous graph to accurately extract emotion, cause, and emotion category triplets from conversations. They also release MECAD, the first multi-scenario multimodal dataset for this task, addressing the scarcity of diverse data. M3HG significantly outperforms existing methods by explicitly modeling emotional and causal contexts and fusing semantic information at both inter- and intra-utterance levels, proving robust across various conversation complexities.
Researchers from Tongji University have introduced a groundbreaking approach to understanding emotions and their origins in conversations, particularly in the complex world of social media. Their new work addresses a critical challenge known as Multimodal Emotion Cause Triplet Extraction in Conversations (MECTEC), which involves simultaneously identifying emotion utterances, their cause utterances, and the specific emotion categories from conversations that include text, audio, and video.
The field of MECTEC has been hampered by a significant lack of diverse datasets. Previously, only one dataset existed, the ECF dataset, which was limited to conversations from a single TV series, ‘Friends’. This narrow scope made it difficult for models to generalize to the wide variety of real-world dialogue scenarios. To overcome this, the team developed MECAD, the first multimodal and multi-scenario MECTEC dataset. MECAD features 989 conversations extracted from 56 different TV series, offering a much richer and more varied collection of dialogue contexts. This new dataset is expected to significantly accelerate model development in this area.
Beyond the dataset, existing MECTEC methods also struggled with several key issues. They often failed to explicitly model the specific contexts related to emotions and their causes. Furthermore, they neglected to effectively combine semantic information from different levels within a conversation – both within a single utterance (intra-utterance) and across multiple utterances (inter-utterance). This led to a degradation in performance, especially when trying to identify causes that appear later in a conversation than the emotion itself.
To tackle these deficiencies, the researchers propose a novel model called M3HG, which stands for Multimodal, Multi-scale, and Multi-type Node Heterogeneous Graph. M3HG is designed to explicitly capture emotional and causal contexts. It achieves this by effectively fusing contextual information at both inter- and intra-utterance levels through a sophisticated multimodal heterogeneous graph structure. This graph includes different types of nodes, such as emotional context nodes, causal context nodes, utterance ‘Super-Nodes’ (which combine text, audio, and video features for each utterance), and a conversation ‘Super-Node’ that captures global information. These nodes are connected by various ‘Super-Edges’ that represent relationships like same speaker, different speaker, global connections, and specific emotion or cause connections.
The M3HG model processes conversations in four main stages: unimodal feature extraction (using specialized tools for text, audio, and video), graph construction, multi-scale semantic fusion (integrating information within and between utterances), and finally, emotion-cause classification. This comprehensive approach allows M3HG to understand the intricate relationships between emotions and their causes, even when the cause appears after the emotion in a conversation.
Extensive experiments conducted on both the ECF and the new MECAD datasets demonstrate the superior performance of M3HG. The model consistently outperformed existing state-of-the-art methods, showing significant improvements in accurately extracting emotion cause triplets. This was particularly evident in challenging emotion categories and in longer conversations. The research highlights that M3HG’s ability to integrate multimodal and multi-scale semantic information is crucial for its effectiveness.
While M3HG represents a significant leap forward, the authors acknowledge some limitations. Future work will explore integrating external knowledge and leveraging advanced semantic extraction capabilities of large language models to further enhance accuracy. Additionally, addressing challenges with excessively long conversations and potential error propagation in multimodal fusion are areas for continued improvement.
Also Read:
- MM-ORIENT: A New AI Framework for Deeper Multimodal Content Understanding
- Beyond Static Feelings: A Dynamic Approach to Emotion Understanding
This research not only provides a powerful new tool for emotion cause analysis but also contributes a valuable new dataset, MECAD, which will foster further innovation in the field. The code and dataset are publicly available, encouraging broader research and development. For more details, you can read the full research paper here.


