TLDR: A research note by Parker Whitfill highlights a potential selection bias in observational studies, like Ho et al. , that estimate algorithmic progress in language models. The paper argues that if unobserved algorithmic quality influences compute choices, then estimates of progress can be biased. Simulations confirm this, showing that the estimated rate of algorithmic progress can be significantly over or underestimated depending on the correlation between latent algorithmic quality and compute usage.
A recent research note by Parker Whitfill delves into a critical methodological challenge in estimating the true rate of algorithmic progress in artificial intelligence, particularly concerning large language models. The paper, titled “Note on Selection Bias in Observational Estimates of Algorithmic Progress,” scrutinizes the approach taken by previous studies, such as Ho et al. , which attempt to quantify how efficiently language models improve over time.
Ho et al. gathered observational data on language models’ performance (loss) and the computational resources (compute) used over a period. Their conclusion was that algorithmic efficiency has been steadily increasing, meaning models achieve better performance for a fixed amount of compute as time progresses. This is a significant finding for understanding the pace of AI development.
However, Whitfill’s note raises a crucial concern: the potential for selection bias. The core argument is that if certain aspects of algorithmic quality are unobservable (latent), and the decisions made by AI labs regarding compute usage are influenced by this unobserved quality, then the estimates of algorithmic progress derived from observational data might not be accurate or unbiased. Imagine a scenario where a lab discovers a breakthrough in algorithmic efficiency (a latent quality). This breakthrough might influence how much data or parameters they choose to use for their next model. If this relationship isn’t accounted for, the observed improvements could be misattributed.
To illustrate, Ho et al.’s original model estimates loss based on factors like the number of model parameters (N), training data points (D), and productivity factors (qN, qD) that capture efficiency. They assumed these productivity factors grow deterministically with calendar time. Whitfill challenges this by suggesting that productivity factors also include a random, unobserved component (ϵN, ϵD) that represents within-year, across-lab algorithmic heterogeneity. For instance, one lab might have inherently better algorithms than another in the same year, even if both are working on similar models.
When this unobserved heterogeneity (ϵD) is correlated with the chosen dataset size (D), and we try to estimate the original model without accounting for ϵD, our estimates can become biased. This is akin to a classic “omitted variable problem” in statistics. If, for example, labs with better algorithms (higher ϵD) also tend to use larger datasets (higher D), then the observed decrease in loss might be partly due to the better algorithms, but the statistical model might incorrectly attribute all the improvement to the larger dataset, leading to a biased estimate of the impact of data scaling.
The paper formally analyzes this bias, particularly focusing on the rate of algorithmic progress. A key finding is that the sign of the bias on the estimated algorithmic progress is opposite to the sign of the correlation between the logarithm of the dataset size and the unobserved algorithmic quality. This means if better algorithms lead to larger datasets, algorithmic progress might be underestimated, and vice-versa.
The direction of this correlation is not immediately obvious. On one hand, firms with superior data engineering pipelines and algorithmic know-how might naturally accumulate larger, higher-quality datasets, suggesting a positive correlation. On the other hand, if a lab develops exceptionally efficient algorithms, they might achieve their desired performance with less data, potentially leading to a negative correlation. For example, if Anthropic has better algorithms than xAI, they might not need to scale their data as much to achieve a leading model, implying that better algorithms (high ϵ) could be associated with smaller datasets (low D).
Empirical evidence cited in the paper hints at the potential for negative bias. Studies like Hoffmann et al. and Besiroglu et al. , which estimate the impact of dataset size experimentally (thus avoiding selection bias), found a parameter (β) around 0.37. In contrast, Ho et al. estimated β to be around 0.04. This significant discrepancy could indicate a negative bias in Ho et al.’s estimate of β, which in turn could mean that their estimate of algorithmic progress is overstated, possibly by a factor of nine.
To further validate these theoretical insights and quantify the potential magnitude of the bias, Whitfill conducted Monte Carlo simulations. These simulations, using actual dataset sizes and years from Ho et al.’s data, confirmed the theorem: positive correlation between latent quality and data size led to underestimation of progress, while negative correlation led to overestimation. The simulations demonstrated that the bias can be economically significant, with a true 45% annual progress potentially being estimated as low as 16.5% or as high as 93% depending on the correlation.
Also Read:
- Unpacking AI’s Inventory Decisions: A New Benchmark Reveals Human-Like Biases
- Optimizing Data Mixtures for Language Models with Bayesian Approaches
In conclusion, this research note serves as an important caution for anyone attempting to infer algorithmic progress from observational data. The endogeneity of compute choices to algorithmic quality is a pervasive issue. The paper suggests that future research could mitigate this problem by focusing on experimental designs where compute or algorithms are randomly assigned, or by finding plausible instrumental variables that exogenously vary compute allocations without affecting algorithm quality. For more technical details, you can refer to the full paper available at arXiv:2508.11033.


