TLDR: The research paper introduces MME-SID, a novel framework that significantly improves Large Language Models (LLMs) for sequential recommendation. It tackles two critical issues: embedding collapse, where item representations become too similar, and catastrophic forgetting, which is the loss of learned information. MME-SID achieves this by integrating multimodal embeddings (collaborative, textual, visual) and semantic IDs, utilizing a specialized Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE) with Maximum Mean Discrepancy (MMD) for better information preservation, and employing a frequency-aware fusion module during LLM fine-tuning. Experiments on Amazon datasets confirm its superior performance and effectiveness in mitigating these challenges.
In the rapidly evolving landscape of digital platforms, sequential recommendation (SR) systems play a crucial role in understanding user preferences and suggesting relevant items based on their past interactions. With the advent of powerful large language models (LLMs), there’s been a growing interest in leveraging their capabilities for SR. However, researchers have identified two significant hurdles that limit the effectiveness and scalability of current LLM-based SR methods: embedding collapse and catastrophic forgetting.
A new research paper titled “Empowering Large Language Model for Sequential Recommendation via Multimodal Embeddings and Semantic IDs” by Yuhao Wang, Junwei Pan, Xinhang Li, Maolin Wang, Yuan Wang, Yue Liu, Dapeng Liu, Jie Jiang, and Xiangyu Zhao, introduces a novel framework called MME-SID designed to overcome these challenges. The paper, available at arXiv:2509.02017, proposes an innovative approach that integrates multimodal embeddings and semantic IDs to enhance LLMs for sequential recommendation tasks.
Understanding the Challenges
Embedding Collapse: This phenomenon occurs when the embedding representations of items become too similar, effectively occupying a low-dimensional subspace. In simpler terms, the model struggles to differentiate between items, leading to inefficient use of its capacity and suboptimal recommendations. The paper highlights that this often happens when low-dimensional collaborative embeddings from traditional recommendation models are mapped into the high-dimensional space of LLMs.
Catastrophic Forgetting: This refers to the loss of previously learned knowledge when new information is incorporated. In the context of semantic IDs (which represent items as sequences of codes), existing methods often discard the rich information contained in the initial code embeddings after training. When these embeddings are re-initialized from scratch for downstream tasks, a significant amount of valuable distance information is lost, hindering the model’s performance.
MME-SID: A Two-Stage Solution
MME-SID addresses these issues through a comprehensive two-stage framework: an Encoding Stage and a Fine-tuning Stage.
Encoding Stage: Building Rich Multimodal Representations
This stage focuses on creating informative multimodal embeddings and their corresponding semantic IDs for each item. It involves two key steps:
- Multimodal Embedding Encoding: Instead of relying solely on item IDs, MME-SID incorporates collaborative (traditional item ID data), textual (item titles, descriptions), and visual (item images) information. It uses a powerful multimodal encoder like LLM2CLIP, which enhances the original CLIP model by integrating a more capable LLM for text processing. This ensures that textual and visual information are mapped into a unified embedding space.
- Multimodal Embedding Quantization: To generate multimodal semantic IDs, the framework introduces a Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE). Unlike previous methods that use simple mean squared error (MSE) for reconstruction, MM-RQ-VAE employs Maximum Mean Discrepancy (MMD) as its reconstruction loss. MMD is better at preserving the distribution of distance information, which is crucial for maintaining the integrity of item representations. Additionally, a contrastive learning objective is used to capture the relationships and distinctions between different modalities (e.g., aligning collaborative embeddings with textual and visual ones).
Fine-tuning Stage: Efficient LLM Adaptation
The second stage fine-tunes the LLM for the sequential recommendation task, specifically tackling catastrophic forgetting and optimizing performance:
- Embedding Initialization: A critical innovation is initializing the embeddings of the multimodal semantic IDs with the *trained* code embeddings from the MM-RQ-VAE. This prevents the loss of valuable intra-modal information that would occur with random initialization, directly mitigating catastrophic forgetting.
- Multimodal Frequency-Aware Fusion: The model also incorporates a module that adaptively fuses the scores from different modalities based on the item’s frequency in the training data. This acknowledges that the importance of different modalities can vary for popular versus less common (cold-start) items, leading to more nuanced and effective recommendations.
- Efficient Fine-tuning: The LLM is efficiently fine-tuned using LoRA (Low-Rank Adaptation), updating only a small fraction of the model’s parameters, making the process computationally feasible.
Also Read:
- Enhancing Omni-Modal Language Models: A New Framework to Combat Hallucinations
- AudioCodecBench: A New Standard for Evaluating Audio Codecs in Large Language Models
Key Advantages and Experimental Validation
MME-SID offers several advantages over existing methods. It can generate a ranking list for the entire item set, unlike some generative retrieval models that retrieve items one by one. It naturally avoids the “collision” issue, where multiple items might map to the same semantic ID sequence, thanks to its multimodal data integration. Furthermore, MME-SID achieves higher inference efficiency by representing each item as a less collapsed, less forgotten, and more informative multimodal embedding.
Extensive experiments conducted on three public Amazon datasets (Beauty, Toys & Games, and Sports & Outdoors) demonstrate MME-SID’s superior performance. The framework consistently outperforms various baseline methods, showing significant improvements in recommendation accuracy. The research also provides in-depth analyses confirming MME-SID’s ability to effectively mitigate both embedding collapse and catastrophic forgetting, validating the design choices like MMD-based reconstruction loss and the strategic initialization of code embeddings.
In conclusion, MME-SID represents a significant step forward in empowering large language models for sequential recommendation, offering a robust and efficient solution to long-standing challenges in the field.


