TLDR: A study evaluating Claude 3.5, DeepSeek v2, Gemini 2.5, GPT-4, and Mistral 24B for automated essay scoring found low agreement with human ratings and weak internal consistency, especially for context-dependent criteria. This suggests current LLMs struggle to replicate human judgment in nuanced academic assessment, emphasizing the need for human oversight.
A recent study delved into the effectiveness of Large Language Models (LLMs) for automatically grading student essays in higher education. The research, titled “Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education,” was conducted by Andrea Gaggioli, Giuseppe Casaburi, Leonardo Ercolani, Francesco Collovà , Pietro Torre, and Fabrizio Davide. This investigation aimed to understand how well these advanced AI models could replicate human judgment and maintain consistency in evaluating academic writing.
The study focused on five prominent LLMs: Claude 3.5, DeepSeek v2, Gemini 2.5, GPT-4, and Mistral 24B. These models were tasked with scoring 67 Italian-language student essays from a university psychology course. The essays were evaluated based on a four-criterion rubric: Pertinence, Coherence, Originality, and Feasibility. To check for consistency, each model scored every essay three times.
The findings revealed a significant gap between human and LLM evaluations. The agreement between human graders and the AI models was consistently low and not statistically significant. This suggests that the scores generated by LLMs did not reliably align with how human experts would grade the essays. Furthermore, the internal consistency of the models across their three scoring attempts for each essay was also weak, particularly for criteria like Pertinence and Feasibility. This indicates that even with identical prompts, the models could produce varied scores, highlighting the stochastic nature of text generation.
Interestingly, while some models like Claude 3.5 and Gemini 2.5 tended to give higher overall scores, and Mistral 24B tended to give lower scores, these general tendencies didn’t mean they were accurately reflecting human rankings. The study found that LLMs struggled most with criteria that required deeper disciplinary insight and contextual understanding, such as Pertinence (relevance to the theme and skill definition) and Feasibility (practical applicability). They performed slightly better, though still with limitations, on more structural aspects like Coherence and Originality.
Also Read:
- Redefining ‘Ground Truth’ in Educational AI: Moving Beyond Simple Agreement
- Bridging the Cognitive Divide: Why AI’s Goals Differ from Human Intentions
The research emphasizes that current LLMs might not be ready to fully replace human judgment in complex academic assessment tasks, especially those requiring nuanced interpretation and domain-specific expertise. The authors suggest that human oversight remains crucial when evaluating open-ended academic work. This study contributes to the growing body of literature on AI in education, highlighting the need for careful consideration and safeguards when deploying LLMs for automated assessment. For more details, you can read the full research paper here: Research Paper.


