TLDR: A new strategy utilizes Pareto front approximation to identify which offline metrics reliably predict online performance in large-scale recommender systems. Validated through a significant online experiment on the OTTO e-commerce platform, the approach allows a single model to simultaneously test multiple objectives, revealing strong correlations between offline metrics like Recall@20 and a new metric, Order Density (OD@20), and online KPIs such as click-through rate, conversion rate, and units sold. This method empowers businesses to make more data-driven decisions and effectively bridge the gap between offline evaluations and real-world online impact.
In the world of large-scale e-commerce, optimizing recommender systems is crucial for meeting diverse business goals. A persistent challenge for companies like OTTO has been accurately predicting how well a recommender system will perform in the real world based on its offline tests. Often, what looks good in an offline evaluation doesn’t always translate to success when the system goes live, leading to a significant gap between offline metrics and actual online performance indicators.
A new pragmatic strategy, leveraging recent advancements in Pareto front approximation, aims to bridge this critical gap. This innovative approach allows for the simultaneous testing of multiple user groups, each with different offline performance goals, all while being powered by a single, scalable model. This method is particularly versatile as it is ‘model-agnostic’ for systems built with a neural network backbone, meaning it can be applied across various architectures and domains.
The core idea involves training a single recommender system model that can adapt its recommendations based on different ‘preference vectors.’ These vectors essentially tell the model which objectives to prioritize. During an online experiment, live user traffic is randomly divided into several groups. Each group’s requests are then served by the same model, but with a unique preference vector assigned to that group. This allows researchers to observe how different offline metric configurations impact real-world online performance.
To make this strategy widely applicable, even to systems initially designed for a single objective, the researchers introduce an ‘auxiliary distortion loss.’ This artificial second objective creates the necessary trade-off for Pareto front approximation, allowing the method to be seamlessly integrated into existing single-objective models with minimal overhead.
The effectiveness of this strategy was rigorously validated through a large-scale online experiment conducted on the OTTO e-commerce platform. The experiment focused on session-based recommender systems, which suggest items to users based on their current browsing session. The study analyzed the relationships between various offline metrics and key online performance indicators (KPIs) such as click-through rate (CTR), post-click conversion rate (CVR), and total units sold.
A novel offline metric, ‘order density at 20’ (OD@20), was introduced to estimate the post-click conversion rate. This metric measures the empirical probability that a clicked item is ordered if it’s ranked within the top 20 positions. The study also used Recall@20 as an offline metric for predicting click-through rate. Furthermore, a product metric, Recall@20 multiplied by OD@20, was proposed as a strong offline proxy for predicting units sold.
The results from the two-week online experiment, which involved around 26.5 million impressions, were highly significant. Recall@20 was found to be a strong positive predictor for CTR. OD@20 showed a significant positive correlation with CVR. Crucially, the combined metric, Recall@20 · OD@20, was a significant positive predictor for units sold. Interestingly, while Recall@20 was a negative predictor for CVR, the overall finding indicated that sacrificing some OD@20 to increase Recall@20 proved to be a more efficient strategy for driving a higher number of units sold on the OTTO platform.
Also Read:
- Navigating Dynamic Software: A New Approach to Online Performance Prediction
- Deep Learning Continues to Lead in Information Retrieval: Insights from TREC 2021
This research provides industry practitioners with a valuable tool for understanding the complex relationships between offline and online metrics, enabling more informed and data-driven decisions in optimizing recommender systems. The findings suggest that Pareto front approximation techniques hold significant promise for future research aimed at closing the persistent gap between offline evaluations and real-world online impact. For more details, you can read the full paper here.


