AI Models Outperform Individual Humans in Predicting Everyday Social Norms

TLDR: A new study demonstrates that advanced AI models, including GPT-4.5, GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro, can predict human social appropriateness judgments for everyday scenarios with greater accuracy than individual human participants. The research, which evaluated AI’s ability to estimate collective social norms from linguistic data, found strong correlations with human consensus and superior performance in predicting group averages. Despite this, the models exhibited systematic errors, suggesting limitations in resolving semantic ambiguity, overcoming training data biases, and understanding context-dependent valence shifts, highlighting the boundaries of purely statistical social learning.

A groundbreaking study has revealed that artificial intelligence models can predict everyday social norms with an accuracy that surpasses individual human judgment. This research challenges long-held theories in cognitive science about how social understanding is acquired, suggesting that sophisticated social knowledge can emerge purely from statistical learning over linguistic data, without the need for embodied social experience.

Unpacking Social Norms: A New AI Frontier

Social norms are the unwritten rules that govern appropriate behavior in various situations, from laughing at a job interview to crying on a bus. Humans typically learn these through lived experiences and social interactions. However, the study, detailed in the paper AI Models Exceed Individual Human Accuracy in Predicting Everyday Social Norms, investigated whether large language models (LLMs) like GPT and Gemini could grasp these nuanced norms through statistical patterns in text alone.

Previous research into AI’s social reasoning often focused on high-stakes moral dilemmas or simple binary classifications, and typically benchmarked AI against aggregate human averages. This approach overlooked the continuous, subtle nature of everyday appropriateness and the natural variation among individual human judgments. The current study aimed to address these limitations by comparing AI performance directly against individual human participants.

The Study’s Innovative Approach

The researchers utilized a comprehensive dataset of 555 everyday scenarios, each rated for appropriateness on a 0-9 scale by 555 U.S. participants. These scenarios were created by combining 37 common behaviors (e.g., ‘Argue’, ‘Cry’, ‘Read’) with 15 common situations (e.g., ‘in a bar’, ‘in church’, ‘at a job interview’).

In Study 1, the team evaluated GPT-4.5. The AI was prompted to estimate the average appropriateness rating that U.S. respondents would give for each scenario. This meta-cognitive task, predicting the collective judgment, was compared to individual humans providing their own subjective ratings.

AI’s Remarkable Predictive Power

The results from Study 1 were striking. GPT-4.5’s predictions showed an exceptionally strong correlation with average human ratings, explaining 89% of the variance in human social norms (R² = 0.89). More impressively, when measured by Mean Absolute Error (MAE) from the group average, GPT-4.5 outperformed every single human participant, placing it in the 100th percentile of human accuracy.

Study 2 replicated and extended these findings with next-generation models released later in 2025: GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro. All three models also demonstrated strong correlations with human consensus and outperformed the vast majority of individual human participants (96% or more). GPT-5 achieved the highest correlation (R² = 0.91), while GPT-4.5, surprisingly, maintained a lower average error, indicating better calibration.

Understanding AI’s Systematic Limitations

Despite their high accuracy, the AI models exhibited systematic and correlated errors, suggesting shared limitations in their social understanding. For instance, all models significantly underestimated the appropriateness of “reading in church.” Humans likely interpret this as reading religious texts, while the AI might activate a general script of reading as a solitary, potentially disruptive activity during a service. This points to challenges in resolving semantic ambiguity that requires experiential knowledge.

Conversely, models consistently overestimated the appropriateness of “kissing at the movies,” possibly due to biases in training data where media portrayals might overemphasize romantic behaviors in cinema. They also underestimated “mumbling at the movies,” failing to recognize that in this specific context, it’s the most appropriate way to communicate without disturbing others.

These consistent error patterns across different AI architectures suggest that while statistical learning from text is incredibly powerful, certain aspects of social understanding may require computational mechanisms beyond pure pattern recognition, potentially involving embodied or experiential knowledge.

Also Read:

Implications for Cognitive Science and AI Development

This research provides strong support for bottom-up theories of social learning, suggesting that language serves as a remarkably rich and structured repository for cultural knowledge. The ability of AI to predict collective norms better than individual humans implies that cultural knowledge might be more structured and accessible than previously thought, even amidst individual variations.

From a computational perspective, the study highlights that different facets of “social intelligence”—like understanding normative structures (correlation) and providing precisely calibrated estimates (MAE)—are distinct capabilities that may not improve in lockstep during model development. The individual-level benchmarking methodology introduced in this paper offers a powerful new tool for evaluating computational models of social cognition, helping to assess whether AI performance falls within or exceeds the typical range of human variation.

While the study focused on U.S. cultural norms and compared AI’s meta-cognitive task to humans’ subjective judgments, its findings have profound implications. They suggest that sophisticated models of social cognition can emerge from statistical learning alone, yet also reveal systematic boundaries, indicating that a complete understanding of human-like social intelligence may require integrating statistical, experiential, and embodied forms of knowledge.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI Models Outperform Individual Humans in Predicting Everyday Social Norms

Unpacking Social Norms: A New AI Frontier

The Study’s Innovative Approach

AI’s Remarkable Predictive Power

Understanding AI’s Systematic Limitations

Implications for Cognitive Science and AI Development

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Gabriel Marketing Group Introduces Generative Engine Optimization (GEO) Content Services for B2B Technology Companies Amidst AI Evolution

OpenAI Unveils ‘Friendlier’ GPT-5.1 for ChatGPT, Emphasizing Enhanced User Experience and Adaptive Intelligence

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates