Geometric Insights for Valuing Data in Machine Learning

TLDR: This paper introduces a geometric, computationally efficient method for valuing individual data points using statistical leverage scores, offering a practical alternative to the expensive Shapley valuation. It shows these scores satisfy key Shapley axioms and, with an extension to “ridge leverage scores,” ensure all relevant data contributes positively. The method provides theoretical guarantees for model performance on leverage-sampled subsets and empirically outperforms baselines in active learning without requiring gradients.

In the rapidly evolving landscape of machine learning, understanding the true value of individual data points has become a critical challenge. Traditional methods for data valuation, such as Shapley data valuation, offer a robust theoretical framework but often fall short in practical applications due to their immense computational cost, especially with large datasets. Imagine needing to retrain a model countless times just to figure out how important each piece of data is – it’s simply not scalable.

A new research paper, “Geometric Data Valuation via Leverage Scores,” by Rodrigo Mendoza-Smith from Isotropic, introduces an innovative geometric approach that promises to make data valuation both principled and practical. The core of this new method lies in statistical leverage scores, a concept borrowed from numerical linear algebra. These scores offer a way to quantify a datapoint’s structural influence, essentially measuring how much it expands the dataset’s representational space and contributes to the problem’s effective dimensionality.

The paper highlights that these geometric scores align well with the foundational principles of Shapley valuation, satisfying key axioms like dummy, efficiency, and symmetry. This means that if a datapoint doesn’t add anything new, it gets a zero value (dummy); the total value of all datapoints equals the total utility (efficiency); and datapoints with similar contributions are valued equally (symmetry).

One of the significant advancements presented is the extension to “ridge leverage scores.” Standard leverage scores can suffer from “dimensional saturation,” where once the dataset’s feature space is fully covered, additional data points might be assigned zero marginal value. Ridge leverage scores overcome this limitation by ensuring that every non-zero datapoint still contributes positively, even after the initial feature space is spanned. This regularization technique connects naturally to classical optimal experimental design criteria, such as A- and D-optimality, which are used to select the most informative experiments.

Beyond theoretical alignment, the research provides strong practical guarantees. The authors demonstrate that training a model on a subset of data selected using these leverage scores can achieve model parameters and predictive risk that are remarkably close (within O(ε)) to what would be achieved by training on the entire dataset. This establishes a clear and rigorous link between how data is valued and the quality of the decisions made by the downstream machine learning model.

To empirically validate their approach, the researchers conducted an active learning experiment using the MNIST dataset and a 3-layer Multi-Layer Perceptron. In this experiment, ridge-leverage sampling consistently outperformed several standard active learning baselines, including K-center, Margin, Entropy, and Expected Gradient Length. Crucially, this superior performance was achieved without needing access to gradients or requiring computationally expensive backward passes, making it a highly efficient and robust strategy for data-efficient learning.

Also Read:

This work offers a model-agnostic perspective on data valuation, focusing on the inherent structure of the dataset itself. It provides a computationally feasible and theoretically sound alternative to traditional Shapley values, opening new avenues for tasks such as identifying mislabeled or redundant data, creating compact and effective training subsets, and even designing fair data markets. For more technical details, you can refer to the full research paper available here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Geometric Insights for Valuing Data in Machine Learning

Gen AI News and Updates

Enhancing Large Language Model Reasoning with Concise Outputs

Teaching Machines to Know When They Don’t Know: A New Approach to AI Trustworthiness

CoPRIS: Accelerating Large Language Model Training with Smart Concurrency and Importance Sampling

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates