When AI Judges Games: Fairness, Fun, and Human Alignment

TLDR: A new study introduces a paradigm for evaluating AI systems based on their ability to evaluate games, rather than just playing them. Focusing on 121 novel board games, researchers compared language and reasoning models against human judgments on game “payoff” (fairness) and “funness.” Findings show that reasoning models generally align better with human judgments than non-reasoning models, especially for payoff. However, a non-monotonic relationship was observed where increasing alignment with game-theoretic optimal solutions sometimes weakened alignment with human judgments. Evaluating “funness” proved more challenging for models, showing inconsistent performance. The study also highlighted variable and unpredictable resource usage by reasoning models, emphasizing the need for more resource-rational AI evaluators.

For decades, the benchmark for artificial intelligence (AI) has often been its ability to master complex games like chess and Go. However, a groundbreaking new study suggests that true reasoning isn’t just about solving problems, but also about evaluating which problems are worth solving in the first place. This research introduces a novel approach: assessing AI systems based on their capacity to evaluate games, rather than just play them.

The paper, titled “EVALUATING LANGUAGE MODELS’ EVALUATIONS OF GAMES,” by Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Prafull Sharma, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, and Thomas L. Griffiths, delves into how modern language and reasoning models compare to human judgment and symbolic computational agents when assessing games. The researchers focused on two key evaluative queries: the ‘payoff’ (or fairness) and the ‘funness’ of games. These queries represent two crucial dimensions for evaluating AI evaluations: how complex a query is to compute and how difficult it is to quantify.

A New Lens on AI Evaluation

The study utilized a large dataset of over 100 novel board games and more than 450 human judgments. Humans evaluated these games as novices, before any actual play, to capture initial impressions. Language and reasoning models were then prompted to provide their own evaluations for both expected payoff and perceived funness.

For ‘payoff’ or ‘fairness’ evaluations, the results were quite revealing. Non-reasoning language models, which directly produced evaluations without intermediate thought processes, showed high similarity to each other but significantly differed from human judgments and game-theoretic optimal outcomes. This suggests that these models might rely on similar inductive biases from their training data, which are insufficient for human-aligned evaluations.

However, when models were allowed to reason through an intermediate ‘chain-of-thought’ (CoT), their game evaluations became more sensible, aligning better with game-theoretic optimal solutions and human judgments. Advanced reasoning models showed even greater alignment with both human judgments and non-linguistic baselines (like tree-search agents). Interestingly, the study observed a non-monotonic relationship: as models became closer to game-theoretic optimal, their fit to human data sometimes weakened. This highlights a potential trade-off between pure rationality and human alignment.

The Elusive Nature of ‘Funness’

Evaluating the ‘funness’ of games proved to be a more complex challenge for AI. Non-reasoning language models consistently produced results that poorly matched human judgments. While reasoning models generally captured human funness judgments better, their performance was inconsistent across different models, with more advanced models not always showing greater alignment. This ‘jaggedness’ in performance aligns with the inherent difficulty of quantifying ‘fun.’

The researchers found that models, when reasoning about funness, discussed factors such as game balance, strategic richness, challengingness, game length, and novelty. However, despite considering similar factors, they often arrived at vastly different funness judgments, indicating disparities in how they compute or aggregate these metrics.

Also Read:

Resource Usage and Future Directions

The study also explored the ‘resource usage’ of reasoning models, measured by the number of reasoning tokens used. It found highly variable and unpredictable token usage across models and queries. Surprisingly, models generally used fewer tokens to estimate funness, despite its greater ambiguity. There was no strong correlation between token usage and game novelty or alignment with human/optimal predictions. This raises important questions about how models dynamically adapt their computational effort based on problem complexity.

This research paves the way for future work in designing more ‘resource-rational’ AI evaluators that can dynamically adjust their compute based on the evaluation query and problem at hand. It also prompts critical questions about whose evaluations AI should align with – perfectly rational agents or human judgments – and the ethical implications of building AI that can anticipate what people find engaging, potentially leading to addictive designs.

The full research paper can be accessed here: Evaluating Language Models’ Evaluations of Games.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

When AI Judges Games: Fairness, Fun, and Human Alignment

A New Lens on AI Evaluation

The Elusive Nature of ‘Funness’

Resource Usage and Future Directions

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates