A Closer Look at Algorithmic Progress Estimates: The Role of Selection Bias

TLDR: A research note by Parker Whitfill highlights a potential selection bias in observational studies, like Ho et al. , that estimate algorithmic progress in language models. The paper argues that if unobserved algorithmic quality influences compute choices, then estimates of progress can be biased. Simulations confirm this, showing that the estimated rate of algorithmic progress can be significantly over or underestimated depending on the correlation between latent algorithmic quality and compute usage.

A recent research note by Parker Whitfill delves into a critical methodological challenge in estimating the true rate of algorithmic progress in artificial intelligence, particularly concerning large language models. The paper, titled “Note on Selection Bias in Observational Estimates of Algorithmic Progress,” scrutinizes the approach taken by previous studies, such as Ho et al. , which attempt to quantify how efficiently language models improve over time.

Ho et al. gathered observational data on language models’ performance (loss) and the computational resources (compute) used over a period. Their conclusion was that algorithmic efficiency has been steadily increasing, meaning models achieve better performance for a fixed amount of compute as time progresses. This is a significant finding for understanding the pace of AI development.

However, Whitfill’s note raises a crucial concern: the potential for selection bias. The core argument is that if certain aspects of algorithmic quality are unobservable (latent), and the decisions made by AI labs regarding compute usage are influenced by this unobserved quality, then the estimates of algorithmic progress derived from observational data might not be accurate or unbiased. Imagine a scenario where a lab discovers a breakthrough in algorithmic efficiency (a latent quality). This breakthrough might influence how much data or parameters they choose to use for their next model. If this relationship isn’t accounted for, the observed improvements could be misattributed.

To illustrate, Ho et al.’s original model estimates loss based on factors like the number of model parameters (N), training data points (D), and productivity factors (qN, qD) that capture efficiency. They assumed these productivity factors grow deterministically with calendar time. Whitfill challenges this by suggesting that productivity factors also include a random, unobserved component (ϵN, ϵD) that represents within-year, across-lab algorithmic heterogeneity. For instance, one lab might have inherently better algorithms than another in the same year, even if both are working on similar models.

When this unobserved heterogeneity (ϵD) is correlated with the chosen dataset size (D), and we try to estimate the original model without accounting for ϵD, our estimates can become biased. This is akin to a classic “omitted variable problem” in statistics. If, for example, labs with better algorithms (higher ϵD) also tend to use larger datasets (higher D), then the observed decrease in loss might be partly due to the better algorithms, but the statistical model might incorrectly attribute all the improvement to the larger dataset, leading to a biased estimate of the impact of data scaling.

The paper formally analyzes this bias, particularly focusing on the rate of algorithmic progress. A key finding is that the sign of the bias on the estimated algorithmic progress is opposite to the sign of the correlation between the logarithm of the dataset size and the unobserved algorithmic quality. This means if better algorithms lead to larger datasets, algorithmic progress might be underestimated, and vice-versa.

The direction of this correlation is not immediately obvious. On one hand, firms with superior data engineering pipelines and algorithmic know-how might naturally accumulate larger, higher-quality datasets, suggesting a positive correlation. On the other hand, if a lab develops exceptionally efficient algorithms, they might achieve their desired performance with less data, potentially leading to a negative correlation. For example, if Anthropic has better algorithms than xAI, they might not need to scale their data as much to achieve a leading model, implying that better algorithms (high ϵ) could be associated with smaller datasets (low D).

Empirical evidence cited in the paper hints at the potential for negative bias. Studies like Hoffmann et al. and Besiroglu et al. , which estimate the impact of dataset size experimentally (thus avoiding selection bias), found a parameter (β) around 0.37. In contrast, Ho et al. estimated β to be around 0.04. This significant discrepancy could indicate a negative bias in Ho et al.’s estimate of β, which in turn could mean that their estimate of algorithmic progress is overstated, possibly by a factor of nine.

To further validate these theoretical insights and quantify the potential magnitude of the bias, Whitfill conducted Monte Carlo simulations. These simulations, using actual dataset sizes and years from Ho et al.’s data, confirmed the theorem: positive correlation between latent quality and data size led to underestimation of progress, while negative correlation led to overestimation. The simulations demonstrated that the bias can be economically significant, with a true 45% annual progress potentially being estimated as low as 16.5% or as high as 93% depending on the correlation.

Also Read:

In conclusion, this research note serves as an important caution for anyone attempting to infer algorithmic progress from observational data. The endogeneity of compute choices to algorithmic quality is a pervasive issue. The paper suggests that future research could mitigate this problem by focusing on experimental designs where compute or algorithms are randomly assigned, or by finding plausible instrumental variables that exogenously vary compute allocations without affecting algorithm quality. For more technical details, you can refer to the full paper available at arXiv:2508.11033.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

A Closer Look at Algorithmic Progress Estimates: The Role of Selection Bias

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

A New Way to Disentangle Data for Scientific Exploration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates