TLDR: This paper introduces “Deep Multimodal Subspace Clustering Networks,” a new deep learning framework for grouping complex data from multiple sources (modalities). It uses an encoder-decoder structure with a self-expressive layer to find hidden data structures. The framework proposes two main fusion techniques: spatial fusion (combining features at different network stages) and a novel affinity fusion (sharing the similarity matrix across modalities). Affinity fusion proved particularly effective, especially for data without direct spatial alignment, significantly outperforming previous methods in clustering accuracy across various datasets like handwritten digits and facial images.
In the rapidly evolving landscape of artificial intelligence, understanding and organizing complex data is paramount. Many real-world applications, from image processing to computer vision and speech recognition, deal with data that, while high-dimensional, often resides within simpler, low-dimensional structures known as subspaces. The challenge lies in identifying these hidden structures and grouping related data points, a task known as subspace clustering.
Traditional subspace clustering methods have made significant strides, particularly those leveraging sparse and low-rank representations. These techniques capitalize on the “self-expressiveness” property, where each data point can be expressed as a combination of others within its subspace. More recently, deep learning has entered this domain, with Deep Subspace Clustering (DSC) networks showing impressive performance by embedding this self-expressiveness into a neural network architecture.
However, data often comes in multiple forms or “modalities” – for instance, a person’s face might be captured by a visible light camera, an infrared camera, and a depth sensor. This is where multimodal subspace clustering becomes crucial. It aims to simultaneously cluster data across these different modalities, leveraging the complementary information each view provides. While existing multimodal methods have explored various approaches, including kernel tricks and co-regularization, a deep learning-based solution for unsupervised multimodal subspace clustering has been largely unexplored until now.
A new research paper, titled “Deep Multimodal Subspace Clustering Networks,” by Mahdi Abavisani and Vishal M. Patel, introduces a novel framework that addresses this gap. This work proposes convolutional neural network (CNN) based approaches for unsupervised multimodal subspace clustering. The core of their proposed system is an autoencoder-like structure comprising three main stages: a multimodal encoder, a self-expressive layer, and a multimodal decoder. The encoder takes data from multiple modalities and combines them into a compact, meaningful “latent space” representation. The self-expressive layer then uses this representation to enforce the self-expressiveness property, generating an “affinity matrix” that captures the relationships between data points. Finally, the decoder reconstructs the original input data from this latent representation, with the network learning by minimizing the difference between the reconstructed and original data.
The researchers investigated two primary strategies for integrating information from different modalities: spatial fusion and affinity fusion.
Spatial Fusion Techniques
Spatial fusion methods focus on combining the raw data or features from different modalities at various points within the encoder. The paper explores three types of spatial fusion, inspired by supervised deep multimodal learning:
- Early Fusion: Data from all modalities are integrated at the very beginning, at the pixel or raw feature level, before being fed into the main network.
- Intermediate Fusion: Modalities are combined at intermediate layers of the encoder, allowing the network to learn some modality-specific features before merging them. This can be particularly useful for aggregating “weaker” or correlated modalities earlier.
- Late Fusion: Each modality is processed through its own separate encoder branches, and their high-level features are combined only at the final layer of the encoder.
For these spatial fusion techniques, the researchers experimented with different “fusion functions” like summing, max-pooling, and concatenation to merge the feature maps. While effective, spatial fusion methods generally assume some level of spatial alignment or correspondence between the different modalities, as seen in datasets like the ARL Polarimetric face dataset where facial components are aligned across different spectrums.
Also Read:
- A Deep Learning Approach to Sparse Representation-based Classification
- Enhancing Single-Modality Hand Gesture Recognition Through Multimodal Training
Affinity Fusion Technique
Recognizing that not all multimodal data inherently share spatial correspondence (e.g., a mouth image and a nose image), the paper introduces an innovative “affinity fusion” approach. Instead of fusing features directly, this method focuses on sharing the affinity matrix across modalities. It proposes stacking multiple parallel Deep Subspace Clustering networks, one for each modality, but critically, they all share a common self-expressive layer. This forces the networks to learn latent representations that result in the same underlying similarity structure across all modalities. The core idea is that if two data points are similar in one modality, they should ideally be similar in others too. This approach elegantly bypasses the need for spatial alignment, making it robust to diverse multimodal datasets.
Extensive experiments were conducted on three diverse datasets: multiview digit clustering (MNIST and USPS), heterogeneous face clustering (ARL Polarimetric face dataset), and facial component clustering (Extended Yale-B dataset). The results consistently demonstrated that the proposed deep multimodal subspace clustering methods significantly outperform state-of-the-art traditional and deep unimodal methods. Notably, the affinity fusion method achieved superior performance, especially on datasets where modalities lacked direct spatial correspondence, such as the facial components from the Extended Yale-B dataset, achieving over 99% accuracy. This highlights its strength in aggregating similarities across disparate data views.
This research marks a significant step forward in unsupervised multimodal learning, offering a powerful deep learning framework that can effectively cluster complex data from multiple sources. The code for this research is publicly available, fostering further exploration and development in the field. You can find the full research paper here.


