TLDR: This research explores the feasibility of data minimization for recommender systems, finding that significant user data reduction is technically possible without major performance loss. However, the practical implementation of data minimization is highly dependent on technical factors like the chosen recommendation model and performance targets, as well as user characteristics such as history size and preference complexity. The study introduces a novel “Greedy Removal” algorithm, which proves effective for stricter performance constraints, and highlights the challenge of establishing a universal standard for “necessary” data.
In an era where digital services heavily rely on personal information, the principle of data minimization has emerged as a crucial legal and ethical mandate. This principle, enshrined in regulations like the European Union’s GDPR, California’s CPRA, and others, dictates that personal data processing should be limited to what is strictly necessary for a specified purpose. However, operationalizing this for modern recommender systems, which thrive on vast amounts of user data, presents a significant challenge.
A recent study, titled “What Data is Really Necessary? A Feasibility Study of Inference Data Minimization for Recommender Systems”, delves into this complex issue. Authored by Jens Leysen, Marco Favier, and Bart Goethals from the University of Antwerp and Monash University, the paper explores the technical feasibility of reducing the amount of implicit feedback data used by recommender systems without compromising their performance.
Rethinking Data Necessity
Traditionally, approaches to data minimization for personalization have often focused on ‘size-constrained minimization,’ where a uniform amount of data is selected for all users. The authors argue that this might not be optimal because data necessity is inherently user-dependent. Different users might require varying amounts of data based on the complexity of their preferences or how difficult it is to model their behavior accurately. To address this, the paper introduces a novel concept: ‘performance-constrained minimization.’ This approach aims to find the smallest data subset for each individual user that still achieves a specified target performance level relative to using their full history.
Minimization Strategies Explored
The research evaluates several strategies for data minimization, categorizing them into heuristic and greedy (optimization-based) approaches. Heuristic methods, such as Random Selection, Most Popular/Least Popular Selection, and Embedding Similarity, use simpler criteria to select interactions. Greedy algorithms, on the other hand, iteratively build or remove subsets of interactions to optimize performance. These include Greedy Forward Selection (GFS) and Greedy Beam Forward Selection (GBFS), which add interactions, and a newly proposed method, Greedy Removal (GR).
Greedy Removal (GR) is a backward approach that starts with a user’s full interaction history and iteratively removes the ‘least informative’ interactions, ensuring that a predefined performance threshold is maintained. This method is particularly effective for stricter performance requirements, as it considers the complete interaction context from the outset, potentially preserving complex dependencies better than forward selection methods.
Key Findings: Feasibility and Challenges
The study demonstrates that substantial data reduction is indeed technically feasible, especially when perfect performance retention is not an absolute requirement. For instance, with a slight allowance for performance decrease (98% retention), the EASE model achieved data reductions where only 49.3% to 67.3% of the original data was needed, and the ItemKNN model saw even greater reductions, down to 15.5% to 49.1% of the original data. However, demanding perfect performance retention (100%) proved largely impractical for some models like EASE, allowing very little data removal.
This highlights a critical insight: the “effective difficulty” of data minimization is not just about the performance threshold but also depends on the recommendation model, the evaluation metric, and the specific dataset. These technical choices significantly influence what data is deemed “necessary,” making a universal interpretation of data minimization challenging.
The Role of User Characteristics
The research also investigated how user characteristics, particularly history size, influence data reduction and computational cost. The findings indicate that users with longer interaction histories generally allow for greater proportional data reduction (lower Minimization Ratio). This suggests that longer histories might contain more redundant information. However, minimizing data for these users also incurs a higher computational cost due to the larger search space involved.
Interestingly, the study observed significant variance in minimization potential even among users with similar history sizes. This suggests that other latent factors, such as the complexity or predictability of user preferences, also play a crucial role in determining how much data can be removed without impacting recommendation quality.
Also Read:
- Navigating the Complex World of Text Anonymization: A Comprehensive Review
- Unpacking Epistemic Forgetting in AI: A Unified Framework for Knowledge Management
Conclusion and Future Directions
The paper concludes that while substantial, per-user inference data reduction is technically feasible for recommender systems, operationalizing the legal principle of data minimization is complex. Data necessity is not absolute; it is highly dependent on the technical setting (algorithms, metrics, goals) and individual user characteristics. This challenges the idea of a standardized approach to data minimization.
The authors call for further research into practical and scalable minimization techniques, especially considering the significant computational costs of current methods. They also emphasize the need to explore how data necessity is defined when recommender systems are evaluated across multiple objectives, such as diversity and serendipity, not just ranking accuracy.


