AI Evaluating AI: A Benchmark-Free Method for LLM Assessment

TLDR: A new research paper introduces LLM-Crowdsourced, a novel benchmark-free evaluation paradigm for large language models (LLMs). This method allows LLMs to generate questions, answer independently, and evaluate each other, addressing issues like data contamination, lack of transparency, and subjective bias in existing evaluation approaches. The study, tested on mathematics and programming tasks with eight mainstream LLMs, reveals significant performance differences, the existence of ‘memorization-based answering,’ and high consistency in LLM evaluations, offering a dynamic, transparent, objective, and professional way to assess AI capabilities.

Evaluating the true capabilities of large language models, or LLMs, has become a significant challenge in the rapidly evolving field of artificial intelligence. Traditional evaluation methods often fall short due to issues like data contamination, where models might have seen test data during training, black-box operations that lack transparency, and subjective human preferences that can bias results.

To address these persistent problems, researchers have introduced a groundbreaking new approach called LLM-Crowdsourced. This innovative paradigm offers a benchmark-free way to assess LLMs, where the models themselves take on the roles of question generators, independent answerers, and mutual evaluators. This self-contained system aims to provide a more comprehensive and reliable assessment of LLM performance.

The Core Principles of LLM-Crowdsourced

The LLM-Crowdsourced method is built upon four key evaluation criteria that existing methods struggle to meet simultaneously:

Dynamic: Questions are generated on the fly by LLMs, ensuring fresh content for each evaluation round. This dynamic nature helps to prevent data contamination and keeps the benchmarks from becoming saturated over time.
Transparent: Every step of the evaluation process, from question generation to answering and scoring, is made public and traceable. This allows external researchers to verify the results independently, boosting credibility.
Objective: By employing a decentralized mutual evaluation mechanism, where multiple LLMs assess each other, the influence of individual subjective preferences (whether human or LLM-based) is significantly reduced, leading to fairer outcomes.
Professional: LLMs, with their vast knowledge bases, can generate and evaluate questions with a level of domain expertise comparable to human experts, particularly in specialized fields like mathematics and programming.

How the Evaluation Pipeline Works

The process unfolds in four distinct phases:

Generate Question: An LLM takes a turn as the ‘questioner,’ creating an original and challenging question along with a reference answer.
Answer Independently: The other LLMs, excluding the questioner, then independently answer the posed question.
Evaluate Mutually: Each LLM evaluates the answers provided by the other models, using the questioner’s reference answer and predefined scoring criteria.
Update Ranking: Scores from the mutual evaluations are aggregated to update the LLMs’ rankings in real-time, providing continuous feedback.

Also Read:

Key Findings from Experiments

The researchers tested LLM-Crowdsourced on eight mainstream LLMs across two classic domains: mathematics and programming. The experiments yielded several fascinating insights:

Distinguishing Performance: The method effectively highlighted significant differences in logical reasoning, engineering capability, and generalization ability among the LLMs. Models like Gemini 2.5 Pro and GPT-4.1 consistently demonstrated strong and stable performance in mathematics.
Professional Question Design: Some LLMs, notably Gemini 2.5 Pro, showed exceptional ability to create highly original and theoretically challenging mathematical questions, even merging complex concepts like number systems and complex analysis.
“Memorization-Based Answering”: The study uncovered instances where LLMs would misrecognize a new question as a familiar one with a similar structure, applying a memorized solution rather than true reasoning. This phenomenon, observed multiple times, underscores the limitations of traditional fixed benchmarks.
High Evaluation Consistency: Interestingly, LLMs demonstrated a high degree of consistency when evaluating each other’s answers, suggesting a robust and objective ‘consensus judgment standard’ among them.
Question-Setting vs. Question-Solving: In programming tasks, LLMs with strong overall capabilities tended to excel in both generating difficult questions and providing high-quality solutions. For example, Gemini 2.5 Pro generated complex questions and optimized solutions efficiently, while others often relied on simpler, brute-force approaches.

This research marks a significant step towards a more scientific and standardized evaluation system for LLMs, addressing critical challenges faced by current methods. By open-sourcing their complete evaluation framework and experimental code, the team aims to foster community collaboration and advance the field of LLM evaluation. You can find more details about this innovative research in the full paper available at arXiv:2507.22359.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI Evaluating AI: A Benchmark-Free Method for LLM Assessment

The Core Principles of LLM-Crowdsourced

How the Evaluation Pipeline Works

Key Findings from Experiments

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates