TLDR: Spotify researchers have developed a new framework called “Profile-Aware LLM-as-a-Judge” to evaluate podcast recommendations. It uses Large Language Models (LLMs) to create natural-language user profiles from listening history, which then serve as a basis for the LLM to judge the quality and alignment of recommended podcast episodes. This approach offers a scalable and interpretable alternative to traditional evaluation methods, showing strong agreement with human judgments in experiments.
Evaluating how well personalized recommendations truly serve users has always been a significant challenge, especially in the world of long-form audio like podcasts. Traditional methods, such as analyzing past listening data or conducting expensive A/B tests, often fall short. Offline metrics can be biased because they only look at content users have already interacted with, missing the full scope of what they might enjoy. On the other hand, A/B testing, while effective, is slow and costly, limiting how many new recommendation models can be tested.
This gap means that developers often have to choose between quick, but limited, evaluations and rigorous, but time-consuming, experiments. Furthermore, these methods struggle to capture the nuanced reasons why a recommendation might or might not resonate with a user, particularly in podcasts where a poor recommendation can lead to significant wasted time for the listener.
To address these issues, researchers from Spotify have introduced a novel framework called “Profile-Aware LLM-as-a-Judge.” This innovative approach leverages Large Language Models (LLMs) to act as intelligent, offline judges, assessing the quality of podcast recommendations in a way that is both scalable and easy to understand. You can find the full research paper here.
How the AI Judge Works
The framework operates in two main stages:
First, a natural-language user profile is created. This profile is automatically generated by an LLM, distilling information from a user’s last 90 days of listening history. Instead of feeding raw, complex data to the LLM, these profiles summarize key aspects of user preferences, including topical interests (what subjects they like), behavioral patterns (how they listen, e.g., finishing episodes or skimming), engagement depth, format preferences (e.g., interviews vs. informal discussions), and tendencies towards exploration or specialization. These profiles serve as a clear, interpretable “content hypothesis” of what the user prefers.
Second, the LLM (acting as the Judge) uses this profile to evaluate recommended podcast episodes. It’s prompted with the user profile and the metadata of a candidate episode to determine how well they align. The framework supports two types of evaluation:
- Pointwise evaluation: The Judge assesses a single episode to see if it aligns with the user’s inferred preferences.
- Pairwise evaluation: Similar to an A/B test, the Judge compares two lists of recommended episodes (each from a different recommendation model) and selects the one that better matches the user’s profile. It even provides a rationale for its decision.
This method significantly reduces the complexity of the input for the LLM, making its judgments more interpretable and reliable. It bridges the gap between simple numerical metrics and the complex, subjective nature of human satisfaction.
Experimental Validation
To test the effectiveness of their framework, the researchers conducted a controlled study with 47 participants. They compared the LLM’s judgments with human feedback. Three different evaluation methods were tested: LaaJ-Profile (their profile-aware judge), LaaJ-History (an LLM given raw listening history), and sBERT-Sim (a non-LLM baseline).
The results showed that the LaaJ-Profile method performed comparably to, and in some cases even outperformed, the variant that used raw listening histories. This highlights the significant value of summarizing user preferences into a concise, interpretable profile. Both LLM-based judges were also effective at identifying recommendations that strongly conflicted with user preferences.
While the LLM judge showed strong agreement with human judgments, especially in preferring one model over another, it tended to be more decisive, recording fewer “ties” compared to human annotators. Human feedback also revealed that factors beyond simple topic matching, such as familiarity with a show, host identity, stylistic tone, and the diversity of recommendations, heavily influenced their preferences. This underscores the complex and subjective nature of podcast listening.
Also Read:
- Improving Language Models for Recommendation Systems with Instruction Tuning Data
- AI Models Master Community Resource Allocation Through Participatory Budgeting
Looking Ahead
The Spotify researchers conclude that their profile-aware LLM-as-a-Judge framework offers a scalable and interpretable way to evaluate personalized podcast recommendations. Future work aims to enhance the accuracy of user profiles by incorporating long-term listening behavior and explicit user feedback. They also plan to explore more adaptive prompting strategies for the LLM to further improve its robustness and reduce any decisiveness bias, ultimately leading to even better podcast recommendations for listeners.


