Unpacking AI's Grasp of Human Reasoning Styles in Social Games

TLDR: The InMind framework evaluates how well large language models (LLMs) can understand and apply individual human reasoning styles, particularly in social deduction games like Avalon. Using detailed gameplay annotations, InMind assesses LLMs on tasks like identifying players, aligning reflections, attributing reasoning traces, and inferring roles. Findings show that while some advanced LLMs exhibit early style-sensitive reasoning, most struggle with dynamic adaptation and grounding their logic in the evolving game context, often relying on superficial cues.

Large Language Models (LLMs) have demonstrated impressive capabilities in various complex tasks, from scientific reasoning to understanding human intentions. However, a critical area often overlooked in their evaluation is their ability to capture and apply the unique, individualized reasoning styles that shape how people interact and make decisions in social settings.

A new research paper introduces InMind, a groundbreaking, cognitively-grounded evaluation framework designed to address this gap. The framework aims to assess whether LLMs can truly internalize and adapt to personalized human reasoning, especially in dynamic and interactive environments.

The Challenge of Individualized Reasoning

Traditional LLM benchmarks often focus on output plausibility or behavioral consistency, providing limited insight into the underlying cognitive mechanisms. In real-world social scenarios, people don’t just arrive at conclusions; they do so through distinct, context-sensitive reasoning trajectories. This individual variation is what the researchers refer to as an ‘individualized reasoning style’.

To effectively evaluate this, the InMind framework leverages Social Deduction Games (SDGs) like Avalon. These games are ideal because they are dynamic, adversarial, and inherently individualized, requiring players to infer hidden mental states and make strategic decisions based on evolving information. Simply producing plausible outputs isn’t enough; an LLM must capture and adapt to a player’s unique style for meaningful human-AI collaboration.

How InMind Works: A Dual-Layer Approach

InMind introduces two complementary gameplay modes: Observer and Participant. In Observer mode, a human subject passively reasons from another player’s perspective without taking action, helping to isolate cognitive patterns from overt behavior. In Participant mode, the subject actively engages in the game, providing annotations from their own viewpoint.

Crucially, InMind integrates dual-layer cognitive annotations:

Strategy Traces: These capture real-time reasoning signals, such as belief updates, intention inferences, and counterfactual thinking, as the game unfolds.
Reflective Summaries: These offer post-game insights, contextualizing key events and evaluating other players’ behaviors and intentions in hindsight.

These rich annotations enable InMind to define four cognitively motivated tasks that jointly evaluate both static alignment and dynamic adaptation of LLMs:

Player Identification: Tests if an LLM can recognize behavioral patterns consistent with a specific reasoning style.
Reflection Alignment: Assesses the model’s ability to ground abstract post-game reflections in concrete gameplay behavior.
Trace Attribution: Probes whether the model can simulate evolving, in-context reasoning across time.
Role Inference: Evaluates if the model can internalize reasoning styles to support belief modeling under uncertainty.

The InMind-Avalon Case Study and Key Findings

The researchers instantiated InMind within the popular six-player social deduction game Avalon, creating the InMind-Avalon dataset. This novel dataset comprises 30 full-session human gameplays, meticulously annotated with detailed cognitive traces and reflective summaries. The game sessions were conducted via online voice chat in Mandarin Chinese, capturing authentic communication dynamics and game-specific expressions.

An extensive evaluation of 11 state-of-the-art LLMs on InMind-Avalon revealed several critical limitations:

Most models, including advanced ones like GPT-4o, heavily rely on superficial lexical patterns, struggling to infer deeper strategic intent.
Temporal alignment between reflective reasoning and specific in-game events remains a significant challenge for nearly all evaluated models.
Dynamic adaptation of strategic reasoning based on evolving interactions is largely insufficient, indicating fundamental shortcomings in LLMs’ capacity for individualized reasoning over time.

However, the study also observed promising potential in certain reasoning-enhanced models, such as DeepSeek-R1, which exhibited early signs of style-sensitive reasoning. These models were better at extracting abstract reasoning traits beyond surface-level linguistic cues.

The findings underscore that while LLMs excel in many areas, their capacity for individualized, adaptive reasoning in complex social environments is still limited. The InMind framework and its accompanying dataset provide a principled tool to guide future advancements toward more personalized and socially aware AI systems. For more details, you can read the full research paper here.

Also Read:

Future Directions

The researchers plan to expand InMind to include other social deduction games with different social structures and interaction patterns, such as Blood on the Clocktower and Werewolf. They also aim to broaden the framework’s application beyond games to domains like multi-agent collaboration, negotiation, and human-AI teaming, where personalized, context-sensitive reasoning is crucial for effective interaction.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking AI’s Grasp of Human Reasoning Styles in Social Games

The Challenge of Individualized Reasoning

How InMind Works: A Dual-Layer Approach

The InMind-Avalon Case Study and Key Findings

Future Directions

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates