TLDR: A new research paper introduces LLM-Crowdsourced, a novel benchmark-free evaluation paradigm for large language models (LLMs). This method allows LLMs to generate questions, answer independently, and evaluate each other, addressing issues like data contamination, lack of transparency, and subjective bias in existing evaluation approaches. The study, tested on mathematics and programming tasks with eight mainstream LLMs, reveals significant performance differences, the existence of ‘memorization-based answering,’ and high consistency in LLM evaluations, offering a dynamic, transparent, objective, and professional way to assess AI capabilities.
Evaluating the true capabilities of large language models, or LLMs, has become a significant challenge in the rapidly evolving field of artificial intelligence. Traditional evaluation methods often fall short due to issues like data contamination, where models might have seen test data during training, black-box operations that lack transparency, and subjective human preferences that can bias results.
To address these persistent problems, researchers have introduced a groundbreaking new approach called LLM-Crowdsourced. This innovative paradigm offers a benchmark-free way to assess LLMs, where the models themselves take on the roles of question generators, independent answerers, and mutual evaluators. This self-contained system aims to provide a more comprehensive and reliable assessment of LLM performance.
The Core Principles of LLM-Crowdsourced
The LLM-Crowdsourced method is built upon four key evaluation criteria that existing methods struggle to meet simultaneously:
- Dynamic: Questions are generated on the fly by LLMs, ensuring fresh content for each evaluation round. This dynamic nature helps to prevent data contamination and keeps the benchmarks from becoming saturated over time.
- Transparent: Every step of the evaluation process, from question generation to answering and scoring, is made public and traceable. This allows external researchers to verify the results independently, boosting credibility.
- Objective: By employing a decentralized mutual evaluation mechanism, where multiple LLMs assess each other, the influence of individual subjective preferences (whether human or LLM-based) is significantly reduced, leading to fairer outcomes.
- Professional: LLMs, with their vast knowledge bases, can generate and evaluate questions with a level of domain expertise comparable to human experts, particularly in specialized fields like mathematics and programming.
How the Evaluation Pipeline Works
The process unfolds in four distinct phases:
- Generate Question: An LLM takes a turn as the ‘questioner,’ creating an original and challenging question along with a reference answer.
- Answer Independently: The other LLMs, excluding the questioner, then independently answer the posed question.
- Evaluate Mutually: Each LLM evaluates the answers provided by the other models, using the questioner’s reference answer and predefined scoring criteria.
- Update Ranking: Scores from the mutual evaluations are aggregated to update the LLMs’ rankings in real-time, providing continuous feedback.
Also Read:
- Unpacking LLM Intelligence: A New Look at How Models Process Information
- Assessing AI’s Grasp of Fundamental Physics: A New Benchmark Framework
Key Findings from Experiments
The researchers tested LLM-Crowdsourced on eight mainstream LLMs across two classic domains: mathematics and programming. The experiments yielded several fascinating insights:
- Distinguishing Performance: The method effectively highlighted significant differences in logical reasoning, engineering capability, and generalization ability among the LLMs. Models like Gemini 2.5 Pro and GPT-4.1 consistently demonstrated strong and stable performance in mathematics.
- Professional Question Design: Some LLMs, notably Gemini 2.5 Pro, showed exceptional ability to create highly original and theoretically challenging mathematical questions, even merging complex concepts like number systems and complex analysis.
- “Memorization-Based Answering”: The study uncovered instances where LLMs would misrecognize a new question as a familiar one with a similar structure, applying a memorized solution rather than true reasoning. This phenomenon, observed multiple times, underscores the limitations of traditional fixed benchmarks.
- High Evaluation Consistency: Interestingly, LLMs demonstrated a high degree of consistency when evaluating each other’s answers, suggesting a robust and objective ‘consensus judgment standard’ among them.
- Question-Setting vs. Question-Solving: In programming tasks, LLMs with strong overall capabilities tended to excel in both generating difficult questions and providing high-quality solutions. For example, Gemini 2.5 Pro generated complex questions and optimized solutions efficiently, while others often relied on simpler, brute-force approaches.
This research marks a significant step towards a more scientific and standardized evaluation system for LLMs, addressing critical challenges faced by current methods. By open-sourcing their complete evaluation framework and experimental code, the team aims to foster community collaboration and advance the field of LLM evaluation. You can find more details about this innovative research in the full paper available at arXiv:2507.22359.


