TLDR: A new study introduces a paradigm for evaluating AI systems based on their ability to evaluate games, rather than just playing them. Focusing on 121 novel board games, researchers compared language and reasoning models against human judgments on game “payoff” (fairness) and “funness.” Findings show that reasoning models generally align better with human judgments than non-reasoning models, especially for payoff. However, a non-monotonic relationship was observed where increasing alignment with game-theoretic optimal solutions sometimes weakened alignment with human judgments. Evaluating “funness” proved more challenging for models, showing inconsistent performance. The study also highlighted variable and unpredictable resource usage by reasoning models, emphasizing the need for more resource-rational AI evaluators.
For decades, the benchmark for artificial intelligence (AI) has often been its ability to master complex games like chess and Go. However, a groundbreaking new study suggests that true reasoning isn’t just about solving problems, but also about evaluating which problems are worth solving in the first place. This research introduces a novel approach: assessing AI systems based on their capacity to evaluate games, rather than just play them.
The paper, titled “EVALUATING LANGUAGE MODELS’ EVALUATIONS OF GAMES,” by Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Prafull Sharma, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, and Thomas L. Griffiths, delves into how modern language and reasoning models compare to human judgment and symbolic computational agents when assessing games. The researchers focused on two key evaluative queries: the ‘payoff’ (or fairness) and the ‘funness’ of games. These queries represent two crucial dimensions for evaluating AI evaluations: how complex a query is to compute and how difficult it is to quantify.
A New Lens on AI Evaluation
The study utilized a large dataset of over 100 novel board games and more than 450 human judgments. Humans evaluated these games as novices, before any actual play, to capture initial impressions. Language and reasoning models were then prompted to provide their own evaluations for both expected payoff and perceived funness.
For ‘payoff’ or ‘fairness’ evaluations, the results were quite revealing. Non-reasoning language models, which directly produced evaluations without intermediate thought processes, showed high similarity to each other but significantly differed from human judgments and game-theoretic optimal outcomes. This suggests that these models might rely on similar inductive biases from their training data, which are insufficient for human-aligned evaluations.
However, when models were allowed to reason through an intermediate ‘chain-of-thought’ (CoT), their game evaluations became more sensible, aligning better with game-theoretic optimal solutions and human judgments. Advanced reasoning models showed even greater alignment with both human judgments and non-linguistic baselines (like tree-search agents). Interestingly, the study observed a non-monotonic relationship: as models became closer to game-theoretic optimal, their fit to human data sometimes weakened. This highlights a potential trade-off between pure rationality and human alignment.
The Elusive Nature of ‘Funness’
Evaluating the ‘funness’ of games proved to be a more complex challenge for AI. Non-reasoning language models consistently produced results that poorly matched human judgments. While reasoning models generally captured human funness judgments better, their performance was inconsistent across different models, with more advanced models not always showing greater alignment. This ‘jaggedness’ in performance aligns with the inherent difficulty of quantifying ‘fun.’
The researchers found that models, when reasoning about funness, discussed factors such as game balance, strategic richness, challengingness, game length, and novelty. However, despite considering similar factors, they often arrived at vastly different funness judgments, indicating disparities in how they compute or aggregate these metrics.
Also Read:
- Beyond Imitation: How Large Language Models Develop Strategic Thinking and Unique Heuristics
- Unlocking Novice Minds: How We Reason About New Games Without Experience
Resource Usage and Future Directions
The study also explored the ‘resource usage’ of reasoning models, measured by the number of reasoning tokens used. It found highly variable and unpredictable token usage across models and queries. Surprisingly, models generally used fewer tokens to estimate funness, despite its greater ambiguity. There was no strong correlation between token usage and game novelty or alignment with human/optimal predictions. This raises important questions about how models dynamically adapt their computational effort based on problem complexity.
This research paves the way for future work in designing more ‘resource-rational’ AI evaluators that can dynamically adjust their compute based on the evaluation query and problem at hand. It also prompts critical questions about whose evaluations AI should align with – perfectly rational agents or human judgments – and the ethical implications of building AI that can anticipate what people find engaging, potentially leading to addictive designs.
The full research paper can be accessed here: Evaluating Language Models’ Evaluations of Games.


