Assessing XRec: A Reproducibility Study on Explainable AI for Recommendations

TLDR: A reproducibility study of the XRec framework, which uses large language models (LLMs) to generate explainable recommendations. The study, using Llama 3 instead of GPT-3.5-turbo, largely confirmed XRec’s ability to produce unique and personalized explanations, and the importance of collaborative information injection. However, it found XRec did not consistently outperform all baseline models and questioned the consistent benefit of user and item profiles. Extension experiments further explored the critical role of Mixture of Experts embeddings in shaping explanation structures.

A recent study by Ranjan Mishra, Julian I. Bibo, Quinten van Engelen, and Henk Schaapman from the University of Amsterdam delves into the reproducibility of “XRec: Large Language Models for Explainable Recommendation” by Ma et al. (2024). This research aimed to replicate the original findings and extend the analysis of XRec, a framework designed to help large language models (LLMs) provide clear, comprehensive explanations for the recommendations they generate.

Understanding XRec: Bridging Collaborative Filtering and LLMs

At its core, XRec is a model-agnostic framework that combines collaborative filtering (CF) with the advanced language capabilities of LLMs. Traditional recommender systems often act as “black boxes,” making it difficult for users to understand why certain items are recommended. XRec addresses this by enabling LLMs to generate personalized and interpretable explanations.

The framework operates through three main components:

Collaborative Relation Tokenizer: This part uses a Graph Neural Network (GNN), specifically LightGCN, to analyze user-item interaction graphs. It captures higher-order collaborative relationships and generates numerical embeddings that represent user and item preferences.
Collaborative Information Adapter: Since GNN embeddings are numerical and LLMs work with text, an adapter is needed. XRec employs a Mixture of Experts (MoE) module that transforms these numerical GNN embeddings into “adapted” embeddings. These adapted embeddings can then be seamlessly integrated into the LLM’s processing.
Unifying CF with an LLM for Explanation Generation: The LLM is prompted to create concise textual profiles from user reviews and item descriptions. These textual profiles, along with the adapted GNN embeddings, are fed into the LLM. Crucially, the LLM’s own weights remain frozen during training; only the MoE adapter learns to integrate the collaborative signals, allowing the model to produce human-readable explanations that reflect both semantic context and collaborative structure.

The Reproducibility Study: Claims and Findings

The Amsterdam team set out to reproduce the original XRec results, but with a key difference: they used Llama 3 as the LLM for evaluation instead of GPT-3.5-turbo. They also built upon the original source code provided by Ma et al. (2024). The study focused on verifying four key claims made by the original authors:

Claim 1: Explainability and Stability: The original paper claimed XRec consistently outperforms baselines in explainability (how good the explanations are) and stability (consistency of explanations). The reproducibility study found that their replication generally yielded lower mean scores and higher standard deviations across various metrics compared to the original. Furthermore, some baseline models occasionally outperformed XRec in certain metrics. Therefore, this claim received limited support.

Claim 2: Unique Explanations: XRec was claimed to generate truly unique explanations for each distinct user-item interaction. The study confirmed this, with a Unique Sentence Ratio (USR) of 1.0 across all datasets, meaning every generated explanation was indeed unique. This claim was validated.

Claim 3: User and Item Profiles: The original authors suggested that including user and item profiles improves XRec’s performance. While the reproduction generally showed lower mean scores when profiles were removed, there were instances where performance was not necessarily improved by their inclusion. This claim received limited support.

Claim 4: Collaborative Information Injection: This claim stated that injecting collaborative information into the LLM’s transformer layers improves XRec’s explainability and stability. The study found that removing these injections consistently led to lower explanation quality and higher instability. This claim was validated, highlighting the importance of collaborative signals.

Extending the Analysis: The Role of Embeddings

Beyond reproducibility, the researchers conducted two extension experiments on the Amazon-books dataset:

Removal of Adapted Embeddings: This experiment investigated the impact of completely removing the adapted user and item embeddings generated by the MoE from the LLM input. The results showed a negative impact on model performance, with a notable decrease in LlamaScore. Interestingly, removing these embeddings also altered the explanation’s sentence structure, leading to more conversational or descriptive styles, rather than the structured format (e.g., “The user would enjoy…”) typically seen with adapted embeddings.
Random Fixed Mixture of Experts Inputs: To understand if the MoE embeddings primarily influence sentence structure alignment rather than meaning, the GNN was removed, and fixed, randomly generated embeddings were used as input to the MoE. This model performed the lowest in most metrics, suggesting that while the MoE embeddings do play a role in structuring explanations, using a single fixed set for an entire dataset hinders the generation of useful explanations.

Also Read:

Challenges and Environmental Footprint

The team faced challenges, including missing functionalities in the original code for certain evaluation metrics and ablation studies, and the need to run experiments with a batch size of 1 due to poor results with larger batches, significantly increasing resource requirements. Despite these hurdles, the study provides an open-source evaluation implementation to enhance accessibility for researchers.

The environmental impact of the experiments was also calculated, with a total of 35.84 kg of CO2 equivalent generated. This is a relatively small footprint, especially when compared to the training of large LLMs like Llama-2-7B, which generated 31.22 tonnes of CO2 equivalent.

In conclusion, while XRec effectively generates personalized explanations and benefits from collaborative information, its consistent superiority over all baselines and the impact of user/item profiles remain areas for further investigation. The study underscores the complex interplay between collaborative signals and language modeling in shaping explanation structures. For more details, you can read the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Assessing XRec: A Reproducibility Study on Explainable AI for Recommendations

Understanding XRec: Bridging Collaborative Filtering and LLMs

The Reproducibility Study: Claims and Findings

Extending the Analysis: The Role of Embeddings

Challenges and Environmental Footprint

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates