TLDR: This paper introduces a geometric, computationally efficient method for valuing individual data points using statistical leverage scores, offering a practical alternative to the expensive Shapley valuation. It shows these scores satisfy key Shapley axioms and, with an extension to “ridge leverage scores,” ensure all relevant data contributes positively. The method provides theoretical guarantees for model performance on leverage-sampled subsets and empirically outperforms baselines in active learning without requiring gradients.
In the rapidly evolving landscape of machine learning, understanding the true value of individual data points has become a critical challenge. Traditional methods for data valuation, such as Shapley data valuation, offer a robust theoretical framework but often fall short in practical applications due to their immense computational cost, especially with large datasets. Imagine needing to retrain a model countless times just to figure out how important each piece of data is – it’s simply not scalable.
A new research paper, “Geometric Data Valuation via Leverage Scores,” by Rodrigo Mendoza-Smith from Isotropic, introduces an innovative geometric approach that promises to make data valuation both principled and practical. The core of this new method lies in statistical leverage scores, a concept borrowed from numerical linear algebra. These scores offer a way to quantify a datapoint’s structural influence, essentially measuring how much it expands the dataset’s representational space and contributes to the problem’s effective dimensionality.
The paper highlights that these geometric scores align well with the foundational principles of Shapley valuation, satisfying key axioms like dummy, efficiency, and symmetry. This means that if a datapoint doesn’t add anything new, it gets a zero value (dummy); the total value of all datapoints equals the total utility (efficiency); and datapoints with similar contributions are valued equally (symmetry).
One of the significant advancements presented is the extension to “ridge leverage scores.” Standard leverage scores can suffer from “dimensional saturation,” where once the dataset’s feature space is fully covered, additional data points might be assigned zero marginal value. Ridge leverage scores overcome this limitation by ensuring that every non-zero datapoint still contributes positively, even after the initial feature space is spanned. This regularization technique connects naturally to classical optimal experimental design criteria, such as A- and D-optimality, which are used to select the most informative experiments.
Beyond theoretical alignment, the research provides strong practical guarantees. The authors demonstrate that training a model on a subset of data selected using these leverage scores can achieve model parameters and predictive risk that are remarkably close (within O(ε)) to what would be achieved by training on the entire dataset. This establishes a clear and rigorous link between how data is valued and the quality of the decisions made by the downstream machine learning model.
To empirically validate their approach, the researchers conducted an active learning experiment using the MNIST dataset and a 3-layer Multi-Layer Perceptron. In this experiment, ridge-leverage sampling consistently outperformed several standard active learning baselines, including K-center, Margin, Entropy, and Expected Gradient Length. Crucially, this superior performance was achieved without needing access to gradients or requiring computationally expensive backward passes, making it a highly efficient and robust strategy for data-efficient learning.
Also Read:
- Understanding LLM Decisions: A New Look at Explainability with llmSHAP
- New Framework Unifies Algorithmic Fairness Evaluation Through Sparsity
This work offers a model-agnostic perspective on data valuation, focusing on the inherent structure of the dataset itself. It provides a computationally feasible and theoretically sound alternative to traditional Shapley values, opening new avenues for tasks such as identifying mislabeled or redundant data, creating compact and effective training subsets, and even designing fair data markets. For more technical details, you can refer to the full research paper available here.


