TLDR: SKATE is a novel, automated evaluation framework where large language models (LLMs) compete by generating and solving verifiable tasks for one another. This game-based approach allows for scalable, objective assessment without human input. Key findings include that weaker LLMs can reliably differentiate stronger ones, models exhibit self-preferencing behavior, and the framework automatically uncovers subtle capability differences, offering a robust way to track LLM progress.
The rapid advancement of large language models (LLMs) has brought about an urgent need for robust, scalable, and unbiased evaluation methods. Traditional evaluation frameworks often fall short, demanding extensive human expertise, proving costly, and struggling to keep pace with the swift evolution of these models. They can also be static or susceptible to manipulation.
Introducing SKATE: A Game-Changing Evaluation Framework
A new research paper introduces SKATE, a novel evaluation framework designed to address these limitations. SKATE, which stands for Scalable Tournament Eval, redefines LLM assessment by treating evaluation as a competitive game. In this setup, LLMs act as both task-setters and solvers, creating and solving verifiable challenges for one another. This innovative approach is fully automated, data-free, and highly scalable, eliminating the need for human input or specialized domain expertise.
The core insight behind SKATE is its game-based structure. Models are incentivized to generate questions that highlight their own strengths while simultaneously exposing the weaknesses of their competitors. Unlike methods that rely on LLM judges, SKATE ensures objective scoring by using verifiable tasks—tasks with clear, systematically assessable solutions. This also allows for open-ended and scalable evaluation, moving beyond the limitations of domain-specific, programmatically generated benchmarks like chess-playing or spatial reasoning tasks.
Code-Output-Prediction (COP) Challenges as a Proof of Concept
As a practical demonstration, the researchers introduced LLM-set Code-Output-Prediction (COP) challenges within the SKATE framework. In COP tasks, an LLM is given a block of code and must predict its output. The correctness of the answer is objectively determined by executing the code in a sandbox. While COP is the initial testbed, the framework is general-purpose and can be adapted to any verifiable task type, such as games, writing code to pass unit tests, or factual questions with definitive answers.
To ensure robust scoring for multiple-choice questions (MCQs), SKATE employs a method that samples LLM responses multiple times with randomly permuted answer sets. This accounts for factors like option ordering and content, providing a stable probability of correctness for each question.
How the Game of SKATE Works
In a typical Game of SKATE, multiple LLMs take turns asking and answering questions over several rounds. Each player attempts to create a verifiable, distractor-rich, and unique question. A question is considered ‘valid’ if it runs without error and the setter successfully generates multiple incorrect ‘distractor’ options. It must also be ‘unique,’ meaning it’s significantly different from questions previously set by that player. All participating LLMs then attempt to answer all questions, and their performance is assessed using a TrueSkill-based ranking system, similar to those used in competitive gaming.
The incentives within the game are designed to encourage strategic behavior: LLMs are rewarded for creating valid questions, for answering their own questions correctly, and crucially, for setting questions that their opponents answer incorrectly. This pushes models to identify and exploit ‘discriminatory niches’—areas where they excel and their competitors do not.
Key Findings from SKATE Experiments
The research yielded several significant findings:
- Weaker Models Can Differentiate Stronger Ones: Experiments showed that a collection of less capable LLMs could reliably score and differentiate between more powerful models. The rankings remained stable even when new, stronger models were introduced into the game.
- Self-Preferencing Behavior: LLMs demonstrated a capacity for ‘self-preferencing,’ meaning they could design questions that favored their own capabilities over those of their competitors. When filtered to questions they answered correctly, all models exhibited this tendency.
- Automatic Discovery of Capability Differences: SKATE automatically surfaced fine-grained capability differences between models. By analyzing questions with high variance in correctness scores among models, the framework could pinpoint specific strengths and weaknesses without human annotation or task curation.
- Adaptive Question Setting: Models adapted their question-setting strategies over time. Initially, some models pitched questions that were too easy or too difficult for themselves, but over time, their question difficulty converged towards a ‘sweet spot’—questions that were as challenging as possible while still being answerable by the task setter.
Also Read:
- Addressing Overconfidence in AI Judges: New Metrics and Ensemble Approaches
- Assessing LLM Vulnerability: A New Look at AI Robustness
Future Implications
While the current proof of concept focuses on COP tasks, which are best suited for pure language models without external tool access, the SKATE framework is highly adaptable. Future work could incorporate other verifiable tasks, such as those involving physical world simulations or API interactions, to broaden its applicability.
SKATE represents an important step towards developing general, scalable evaluation frameworks that can truly keep pace with the rapid progress of LLM capabilities. It not only provides objective assessments but also offers a unique window into emerging strategic behaviors of advanced AI models, such as self-preferencing and adaptive task generation. For more details, you can read the full research paper here.


