TLDR: The research paper analyzes the co-evolution of large language model (LLM) production and their evaluation benchmarks. It finds that LLM creation is rapidly decentralizing across organizations and countries, with declining transparency. In contrast, benchmark influence is centralizing, with a small number of institutions accounting for a disproportionately large share of “benchmark authority.” This concentration offers coordination benefits but also risks path dependence and over-optimization. An agent-based simulation suggests that increasing the entry rate of new benchmarks is more effective at reducing concentration than penalizing over-fit tests. The paper highlights the need for improved transparency in models and broader participation in benchmark development to ensure a robust and equitable AI ecosystem.
The world of large language models (LLMs) is expanding at an incredible pace, with new models emerging from diverse organizations and countries. This rapid growth, however, presents a unique challenge: how do we effectively evaluate these models when the landscape is constantly shifting? A recent research paper, “Emergent evaluation hubs in a decentralizing large language model ecosystem,” delves into this very question, exploring the co-evolution of LLM production and their evaluation benchmarks.
The Expanding Universe of LLMs
The study highlights a significant trend: the number of foundation models has surged, particularly since 2021. Annual releases tripled in 2022 and exceeded 180 in 2023, pushing the cumulative total above 400 by early 2025. This expansion isn’t just in quantity; the scale of these models has also grown, with many pushing towards the trillion-parameter mark. What’s more, the creators of these models have diversified dramatically. Where once a handful of well-known labs dominated, now a broad, decentralized field of over 150 organizations, including startups, medium-sized firms, and academic institutions, contribute to LLM development.
Despite this impressive growth, the paper points out a concerning trend: transparency and open access are not keeping pace. Documentation quality, such as reporting training emissions, hardware, or even basic parameter counts, has declined. Similarly, the share of models released with permissive open-source licenses and downloadable weights has plummeted since 2019, with roughly half of new models being either fully closed or ambiguous in their usage rights. Geographically, model production remains concentrated, with the United States, China, and the United Kingdom being the primary contributors, while large regions like Africa and South America are largely absent.
The Centralizing Force of Benchmarks
In contrast to the decentralizing nature of model production, the evaluation landscape shows a different pattern: centralization. Benchmarks, which serve as common yardsticks for assessing LLMs, have also seen a rapid increase in number and diversity since 2021. These benchmarks now cover a wide array of capabilities, from core language understanding to safety and multimodal tasks. Author participation in benchmark creation has also accelerated, with the cumulative count of unique authors climbing past 1,800 by early 2025.
However, despite this broad participation, the influence of these benchmarks is highly concentrated. The study introduces a “benchmark authority score” that integrates scholarly citations and developer engagement (GitHub stars). It found that the top three institutions alone account for nearly half of all benchmark authority, and the top ten organizations hold over 60%. This concentration is significantly higher than that observed in model production. For instance, the most prolific model producer accounts for about 17% of all models, while the top benchmark contributor holds approximately 31% of benchmark authority. This suggests that while many contribute to creating benchmarks, a select few institutions and countries effectively set the standards for evaluation.
Coordination Benefits and Trade-offs
This concentration of benchmark influence offers both benefits and trade-offs. On the positive side, widely adopted benchmarks provide shared yardsticks, reducing noise and aiding comparability across different models. They act as a coordination infrastructure, supporting standardization and reproducibility in a rapidly diversifying field. However, this centralization also introduces risks. It can lead to “path dependence,” where research focus is steered towards optimizing for specific, widely used tests, potentially neglecting other important capabilities or failure cases. This can also amplify over-optimization incentives, where models are tuned specifically to perform well on leaderboards, potentially overstating their general capabilities.
The paper also explores mechanisms that influence this concentration through an agent-based simulation. It found that a higher entry rate of new benchmarks significantly reduces concentration, leading to a more pluralistic evaluation field. In contrast, penalizing “over-fit” or stale tests had a much smaller effect on reducing concentration. This suggests that fostering the creation of new, diverse benchmarks is key to a more balanced evaluation ecosystem.
Also Read:
- Navigating Open Collaboration in Large Language Model Development
- Unpacking AI’s Ethical Compass: How LLMs Allocate Social Welfare
Implications for the Future of AI Evaluation
The findings have important implications for researchers and policymakers. The declining transparency of models highlights the need for stronger reporting standards, such as comprehensive model cards and incentives for open-weight releases, to improve comparability and reproducibility. The concentrated influence of benchmarks calls for encouraging wider participation in their development, especially from international and underrepresented communities, to diversify evaluative perspectives and avoid narrow lenses.
Ultimately, the study suggests a balanced approach: leveraging widely recognized benchmarks for standardization while actively promoting a broader portfolio of well-documented, auditable benchmarks to ensure comprehensive coverage across tasks, languages, and modalities. This research provides a crucial lens for understanding the evolving dynamics of the LLM ecosystem and can inform the design of more robust and equitable evaluation infrastructures. You can read the full paper for more details at Emergent evaluation hubs in a decentralizing large language model ecosystem.


