TLDR: The Mixture of Complementary Modality Experts (MoCME) framework improves Multi-modal Knowledge Graph Completion (MMKGC) by intelligently fusing diverse data types. It uses a Complementarity-guided Modality Knowledge Fusion (CMKF) module to combine intra-modal “views” and inter-modal information based on their unique contributions, and an Entropy-guided Negative Sampling (EGNS) mechanism to prioritize informative negative examples during training. This approach leads to more robust entity representations and achieves state-of-the-art performance on various benchmark datasets, especially those with rich and varied multimodal inputs.
Knowledge graphs, which model real-world information as interconnected entities and relations, are fundamental to many AI applications. However, these graphs are often incomplete, leading to the task of Knowledge Graph Completion (KGC) – predicting missing facts. Traditionally, KGC methods focus solely on the structural relationships within the graph. Yet, real-world entities often come with rich multimodal information, such as images, text descriptions, audio, and video. Incorporating this diverse data into what are called Multi-modal Knowledge Graphs (MMKGs) can significantly enhance our understanding of entities and improve completion accuracy.
Despite the promise of MMKGs, a significant challenge arises from the uneven distribution of modalities. Some entities might have images but no text, or vice versa, leading to an imbalance that makes it difficult to effectively use all available data. Existing MMKGC methods often rely on simple fusion techniques like attention mechanisms, which tend to overlook a crucial aspect: the “complementarity” between different modalities. Complementarity means that different modalities offer unique, non-overlapping, yet semantically relevant information, allowing a model to compensate for missing or noisy data in one modality by leveraging another.
To address these limitations, researchers have introduced a novel framework called Mixture of Complementary Modality Experts (MoCME). This framework is designed to fully exploit the synergy and unique contributions across various data types, leading to more expressive and robust entity representations. MoCME is built upon two core components.
Complementarity-guided Modality Knowledge Fusion (CMKF)
The first key component is the Complementarity-guided Modality Knowledge Fusion (CMKF) module. This module focuses on intelligently combining information from different modalities. It operates on two levels: intra-modal and inter-modal complementarity. For each individual modality (like images or text), MoCME uses a set of specialized “expert” networks. Each expert is trained to capture different semantic aspects or “views” of that modality. For example, one expert might focus on the appearance of an image, while another focuses on its context. To combine these different views within a single modality, the CMKF module uses a clever adaptive weighting strategy. It assesses how much unique information each view provides by measuring its “mutual information” with other views. Views that offer more distinct, non-overlapping features are considered more complementary and are given higher importance in the fusion process.
This same principle is then extended to fuse information across different modalities. After creating a rich, unified representation for each individual modality, the CMKF module calculates the mutual information between these modality-specific representations. Modalities that provide more unique and non-redundant information are prioritized and weighted higher when forming the final, comprehensive multimodal representation of an entity. This hierarchical approach ensures that the model effectively handles situations where some modalities might be missing, incomplete, or noisy, by relying more on the informative ones.
Also Read:
- Connecting Images and Text for Smarter AI: Introducing MMGraphRAG
- New Framework Boosts Temporal Knowledge Graph Predictions for Evolving Data
Entropy-guided Negative Sampling (EGNS)
The second crucial component of MoCME is the Entropy-guided Negative Sampling (EGNS) mechanism. In KGC, models learn by distinguishing between true facts (positive samples) and false facts (negative samples). However, not all false facts are equally useful for training. Some are too obviously false (easy negatives), while others might be very similar to true facts and thus harder to distinguish (hard negatives). Traditional methods often treat all negative samples equally, which can lead to inefficient training or overfitting.
The EGNS mechanism addresses this by dynamically prioritizing negative samples that are more “informative” and “uncertain.” It does this by calculating the “entropy” of each negative sample, which serves as a measure of its difficulty. Samples with high entropy are those where the model is uncertain about whether they are true or false, meaning they are close to the decision boundary and thus more challenging and valuable for learning. Based on their entropy, negative samples are categorized into easy, ambiguous, or hard. The model then assigns different weights to these categories in its training process, giving more importance to the harder, more informative samples. This strategy helps the model focus on challenging cases, improving its ability to discriminate between true and false relationships and enhancing its overall robustness and generalization.
The MoCME framework has demonstrated state-of-the-art performance across five widely-used benchmark datasets, including MKG-W, MKG-Y, DB15K, TIVA, and KVC16K. The improvements were particularly significant on datasets with a richer variety of multimodal inputs, such as DB15K (which includes numeric data) and TIVA/KVC16K (which include image, text, audio, and video). Ablation studies confirmed that both the complementarity-guided fusion and the entropy-based negative sampling are vital for the framework’s effectiveness. The research highlights the critical role of understanding and leveraging modality complementarity in building robust and semantically rich multimodal knowledge graph reasoning systems. For more technical details, you can refer to the full research paper.


