Large Language Models Show Limited Alignment with Human Essay Grading

TLDR: A study evaluating Claude 3.5, DeepSeek v2, Gemini 2.5, GPT-4, and Mistral 24B for automated essay scoring found low agreement with human ratings and weak internal consistency, especially for context-dependent criteria. This suggests current LLMs struggle to replicate human judgment in nuanced academic assessment, emphasizing the need for human oversight.

A recent study delved into the effectiveness of Large Language Models (LLMs) for automatically grading student essays in higher education. The research, titled “Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education,” was conducted by Andrea Gaggioli, Giuseppe Casaburi, Leonardo Ercolani, Francesco Collovà, Pietro Torre, and Fabrizio Davide. This investigation aimed to understand how well these advanced AI models could replicate human judgment and maintain consistency in evaluating academic writing.

The study focused on five prominent LLMs: Claude 3.5, DeepSeek v2, Gemini 2.5, GPT-4, and Mistral 24B. These models were tasked with scoring 67 Italian-language student essays from a university psychology course. The essays were evaluated based on a four-criterion rubric: Pertinence, Coherence, Originality, and Feasibility. To check for consistency, each model scored every essay three times.

The findings revealed a significant gap between human and LLM evaluations. The agreement between human graders and the AI models was consistently low and not statistically significant. This suggests that the scores generated by LLMs did not reliably align with how human experts would grade the essays. Furthermore, the internal consistency of the models across their three scoring attempts for each essay was also weak, particularly for criteria like Pertinence and Feasibility. This indicates that even with identical prompts, the models could produce varied scores, highlighting the stochastic nature of text generation.

Interestingly, while some models like Claude 3.5 and Gemini 2.5 tended to give higher overall scores, and Mistral 24B tended to give lower scores, these general tendencies didn’t mean they were accurately reflecting human rankings. The study found that LLMs struggled most with criteria that required deeper disciplinary insight and contextual understanding, such as Pertinence (relevance to the theme and skill definition) and Feasibility (practical applicability). They performed slightly better, though still with limitations, on more structural aspects like Coherence and Originality.

Also Read:

The research emphasizes that current LLMs might not be ready to fully replace human judgment in complex academic assessment tasks, especially those requiring nuanced interpretation and domain-specific expertise. The authors suggest that human oversight remains crucial when evaluating open-ended academic work. This study contributes to the growing body of literature on AI in education, highlighting the need for careful consideration and safeguards when deploying LLMs for automated assessment. For more details, you can read the full research paper here: Research Paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Large Language Models Show Limited Alignment with Human Essay Grading

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates