TLDR: A reproducibility study of the XRec framework, which uses large language models (LLMs) to generate explainable recommendations. The study, using Llama 3 instead of GPT-3.5-turbo, largely confirmed XRec’s ability to produce unique and personalized explanations, and the importance of collaborative information injection. However, it found XRec did not consistently outperform all baseline models and questioned the consistent benefit of user and item profiles. Extension experiments further explored the critical role of Mixture of Experts embeddings in shaping explanation structures.
A recent study by Ranjan Mishra, Julian I. Bibo, Quinten van Engelen, and Henk Schaapman from the University of Amsterdam delves into the reproducibility of “XRec: Large Language Models for Explainable Recommendation” by Ma et al. (2024). This research aimed to replicate the original findings and extend the analysis of XRec, a framework designed to help large language models (LLMs) provide clear, comprehensive explanations for the recommendations they generate.
Understanding XRec: Bridging Collaborative Filtering and LLMs
At its core, XRec is a model-agnostic framework that combines collaborative filtering (CF) with the advanced language capabilities of LLMs. Traditional recommender systems often act as “black boxes,” making it difficult for users to understand why certain items are recommended. XRec addresses this by enabling LLMs to generate personalized and interpretable explanations.
The framework operates through three main components:
- Collaborative Relation Tokenizer: This part uses a Graph Neural Network (GNN), specifically LightGCN, to analyze user-item interaction graphs. It captures higher-order collaborative relationships and generates numerical embeddings that represent user and item preferences.
- Collaborative Information Adapter: Since GNN embeddings are numerical and LLMs work with text, an adapter is needed. XRec employs a Mixture of Experts (MoE) module that transforms these numerical GNN embeddings into “adapted” embeddings. These adapted embeddings can then be seamlessly integrated into the LLM’s processing.
- Unifying CF with an LLM for Explanation Generation: The LLM is prompted to create concise textual profiles from user reviews and item descriptions. These textual profiles, along with the adapted GNN embeddings, are fed into the LLM. Crucially, the LLM’s own weights remain frozen during training; only the MoE adapter learns to integrate the collaborative signals, allowing the model to produce human-readable explanations that reflect both semantic context and collaborative structure.
The Reproducibility Study: Claims and Findings
The Amsterdam team set out to reproduce the original XRec results, but with a key difference: they used Llama 3 as the LLM for evaluation instead of GPT-3.5-turbo. They also built upon the original source code provided by Ma et al. (2024). The study focused on verifying four key claims made by the original authors:
Claim 1: Explainability and Stability: The original paper claimed XRec consistently outperforms baselines in explainability (how good the explanations are) and stability (consistency of explanations). The reproducibility study found that their replication generally yielded lower mean scores and higher standard deviations across various metrics compared to the original. Furthermore, some baseline models occasionally outperformed XRec in certain metrics. Therefore, this claim received limited support.
Claim 2: Unique Explanations: XRec was claimed to generate truly unique explanations for each distinct user-item interaction. The study confirmed this, with a Unique Sentence Ratio (USR) of 1.0 across all datasets, meaning every generated explanation was indeed unique. This claim was validated.
Claim 3: User and Item Profiles: The original authors suggested that including user and item profiles improves XRec’s performance. While the reproduction generally showed lower mean scores when profiles were removed, there were instances where performance was not necessarily improved by their inclusion. This claim received limited support.
Claim 4: Collaborative Information Injection: This claim stated that injecting collaborative information into the LLM’s transformer layers improves XRec’s explainability and stability. The study found that removing these injections consistently led to lower explanation quality and higher instability. This claim was validated, highlighting the importance of collaborative signals.
Extending the Analysis: The Role of Embeddings
Beyond reproducibility, the researchers conducted two extension experiments on the Amazon-books dataset:
- Removal of Adapted Embeddings: This experiment investigated the impact of completely removing the adapted user and item embeddings generated by the MoE from the LLM input. The results showed a negative impact on model performance, with a notable decrease in LlamaScore. Interestingly, removing these embeddings also altered the explanation’s sentence structure, leading to more conversational or descriptive styles, rather than the structured format (e.g., “The user would enjoy…”) typically seen with adapted embeddings.
- Random Fixed Mixture of Experts Inputs: To understand if the MoE embeddings primarily influence sentence structure alignment rather than meaning, the GNN was removed, and fixed, randomly generated embeddings were used as input to the MoE. This model performed the lowest in most metrics, suggesting that while the MoE embeddings do play a role in structuring explanations, using a single fixed set for an entire dataset hinders the generation of useful explanations.
Also Read:
- Enhancing Recommendations with LLM Agents: Bridging Reasoning and Scalability
- Enhancing Recommendations: How AI Language Models Clean Up User Interaction Data
Challenges and Environmental Footprint
The team faced challenges, including missing functionalities in the original code for certain evaluation metrics and ablation studies, and the need to run experiments with a batch size of 1 due to poor results with larger batches, significantly increasing resource requirements. Despite these hurdles, the study provides an open-source evaluation implementation to enhance accessibility for researchers.
The environmental impact of the experiments was also calculated, with a total of 35.84 kg of CO2 equivalent generated. This is a relatively small footprint, especially when compared to the training of large LLMs like Llama-2-7B, which generated 31.22 tonnes of CO2 equivalent.
In conclusion, while XRec effectively generates personalized explanations and benefits from collaborative information, its consistent superiority over all baselines and the impact of user/item profiles remain areas for further investigation. The study underscores the complex interplay between collaborative signals and language modeling in shaping explanation structures. For more details, you can read the full paper here.


