Enhancing Recommendation Systems with Multi-Modal Indexing and Lifelong User Behavior

TLDR: MISS (Multi-modal Indexing and Searching with lifelong Sequence) is a new model for large-scale recommendation systems. It addresses challenges in retrieval by integrating multi-modal information (like images and text) and long-term user behavior. It uses a multi-modal index tree for better item similarity representation and two specialized search units (Co-GSU and MM-GSU) to capture diverse user interests from their historical interactions. Online experiments at Kuaishou show significant improvements in recommendation effectiveness and user engagement.

Large-scale recommendation systems, like those used by platforms with vast amounts of content, typically operate in two main stages: retrieval and ranking. The retrieval stage is crucial for quickly sifting through a massive collection of items to identify a smaller, relevant set for a user, even though it has very limited time to do so. Recent advancements in this area aim to incorporate more comprehensive information about users and items to improve performance.

One significant challenge in current retrieval methods is effectively utilizing a user’s lifelong sequential behavior – their long history of interactions. While this data is valuable, it’s difficult to process efficiently in the retrieval stage due to the sheer volume of candidate items. Additionally, many existing retrieval methods primarily rely on interaction data, often overlooking the rich insights available from multi-modal information, such as images and text associated with items.

To address these challenges, researchers have introduced a pioneering model called MISS: Multi-modal Indexing and Searching with lifelong Sequence. This innovative approach integrates multi-modal information and lifelong user behavior into an advanced tree-based retrieval model. MISS is composed of two key components: a multi-modal index tree and a multi-modal lifelong sequence modeling module.

The multi-modal index tree is designed to create a more precise representation of item similarity. Unlike traditional methods that might rely solely on interaction data, this tree is built using multi-modal embeddings. These embeddings combine both content (like images and text) and interaction information, allowing the tree to group similar items more effectively. This hierarchical structure helps in efficiently narrowing down the search for relevant items.

For capturing diverse user interests from their extensive historical interactions, MISS introduces a multi-modal lifelong sequence modeling module. This module features two specialized units: the collaborative general search unit (Co-GSU) and the multi-modal general search unit (MM-GSU). The Co-GSU retrieves relevant behaviors based on collaborative information, while the MM-GSU focuses on multi-modal information. These units work together to identify the most relevant parts of a user’s long behavior sequence, even if those interests are not recent. An exact search unit (ESU) then refines the relationship between candidate items and the retrieved behaviors.

The model also incorporates a Multi-gate Mixture-of-Experts (MMoE) module for multi-task learning, allowing it to optimize for various user feedback signals simultaneously, such as likes, comments, and video completion rates.

MISS has been successfully deployed in Kuaishou’s recommendation system, serving hundreds of millions of daily active users. Offline experiments demonstrated that MISS significantly outperforms state-of-the-art baseline models in retrieval metrics like Recall@K, showing improvements of over 30% in various recall scenarios. An ablation study confirmed the individual effectiveness of each proposed module: the multi-modal index tree, MM-GSU, and Co-GSU.

Further analysis revealed interesting insights into how the model utilizes user behavior. While increasing the length of the user behavior sequence generally improves performance, there’s a trade-off with computational resources. The attention mechanism within MM-GSU was found to be particularly effective at identifying long-term user interests, unlike some traditional models that tend to focus only on recent interactions. The Co-GSU and MM-GSU also complement each other, with a low overlap rate in their search results, indicating they capture different facets of user interest.

Online A/B tests conducted with real users at Kuaishou showed promising results. The proposed MISS model led to a noticeable increase in key user engagement metrics, including Total App Usage Time and App Usage Time Per User, confirming its effectiveness in a real-world industrial setting.

Also Read:

In conclusion, MISS represents a significant step forward in retrieval recommendation by effectively leveraging multi-modal information and lifelong sequential user behavior. This approach provides a more comprehensive understanding of user interests, leading to more accurate and engaging recommendations. For more details, you can refer to the research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing Recommendation Systems with Multi-Modal Indexing and Lifelong User Behavior

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates