Improving Automated Essay Cohesion Scoring with Item Response Theory

TLDR: A new research paper introduces an Item Response Theory (IRT)-based approach, called IRT-Multiregressor, to enhance the accuracy of automated essay cohesion assessment. By treating machine learning models as ‘respondents’ and essays as ‘items,’ IRT is used to adjust and combine predictions from various AI algorithms, including traditional models and BERT-based deep learning. Tested on two large Portuguese essay datasets, the method demonstrated superior performance in aligning automated scores with human evaluations, offering significant potential for personalized feedback, curriculum design, and explainable AI in educational settings.

Assessing the quality of written essays is a cornerstone of education, helping to evaluate students’ writing abilities and learning outcomes. Among the many facets of a good essay, textual cohesion stands out as crucial. Cohesion ensures that different parts of a text are meaningfully connected, making the writing clear and easy to understand. However, automatically scoring cohesion in essays has long been a significant challenge in educational artificial intelligence.

Traditional machine learning algorithms used for text evaluation often overlook the unique characteristics of individual essays. This can lead to inconsistencies between automated scores and human evaluations, making it difficult to provide accurate and helpful feedback to students. The subjective and time-consuming nature of manual essay assessment further highlights the need for robust automated solutions.

A recent research paper, titled Enhancing Essay Cohesion Assessment: A Novel Item Response Theory Approach, introduces an innovative method to tackle this problem. The study proposes an approach based on Item Response Theory (IRT) to refine and adjust the cohesion scores generated by machine learning models. IRT is a statistical framework commonly used in educational testing to model how individuals respond to test items based on their underlying abilities and the characteristics of the items themselves. In this novel application, machine learning models are treated as ‘respondents,’ and individual essays are considered ‘test items,’ allowing for a more nuanced analysis of prediction quality.

The researchers, Bruno Alexandre Rosa, Hilário Oliveira, Luiz Rodrigues, Eduardo Araujo Oliveira, and Rafael Ferreira Mello, aimed to investigate how effectively an IRT-based model could adjust cohesion scores predicted by various AI algorithms. Their approach, named IRT-Multiregressor, integrates IRT concepts to combine predictions from multiple machine learning models, including traditional algorithms and advanced BERT-based deep learning models.

For their experiments, two significant datasets of Portuguese essays were used: the extended Essay-BR dataset, comprising 6,563 essays in the style of Brazil’s National High School Exam (ENEM), and the Brazilian Portuguese Narrative Essays dataset, which includes 1,235 essays from 5th to 9th-grade public school students. These datasets provided a rich and diverse corpus for testing the proposed method, despite exhibiting imbalances in cohesion score distribution, which is a common challenge in such data.

To prepare the essays for machine learning, the team extracted a comprehensive set of 325 linguistic features. These features, categorized into 13 groups such as Coh-Metrix, LIWC, Connectives, and Lexical Diversity, transform textual data into numerical vectors that AI models can process. Additionally, the study leveraged BERT, a powerful pre-trained language model, to generate contextual embeddings, capturing deep semantic relationships within the essays.

Various regression algorithms, including Bayesian Ridge, CatBoost Regressor, and Support Vector Regressor (SVR), along with ensemble methods like Voting and Stacked Regressors, were trained to predict cohesion scores. The BERT language model, specifically BERTimbau-base and DistilBERT, was also fine-tuned for this task.

The core of the IRT-Multiregressor approach lies in its ability to adjust the predictions of these diverse models. After the initial predictions, the IRT framework calculates error expectations for each model and cohesion score range. These error expectations define confidence intervals, which are then used to identify the most frequent and reliable prediction among the models, ultimately leading to a more accurate final cohesion score.

The experimental results were promising. On the Essay-BR dataset, the IRT-Multiregressor All BERT approach significantly outperformed both traditional machine learning models and ensemble methods, as well as standalone BERT models, in metrics like Kappa and Quadratic Weighted Kappa (QWK). For instance, it achieved a Kappa score of 0.516 and a QWK of 0.656, demonstrating a notable improvement in agreement with human assessments. Similar positive trends were observed on the Brazilian Portuguese Narrative Essays dataset, where the IRT-Multiregressor All BERT approach again showed superior performance.

This research has significant implications for education. By providing more accurate and reliable automated cohesion scores, the IRT-Multiregressor approach can enable personalized feedback systems, helping students identify specific writing deficiencies like limited lexical diversity or inconsistent use of referential cohesion. Educators could then recommend targeted exercises, fostering self-regulated learning. Furthermore, insights from this model can inform curriculum design, highlighting common writing challenges across student populations and supporting data-driven pedagogical interventions.

The study also emphasizes the importance of ethical and explainable AI (xAI) in education. By integrating xAI techniques, the model could explain *why* a particular score was given, highlighting factors such as insufficient connectives or limited syntactic diversity. This transparency can build trust among educators and students and, when combined with generative AI tools like ChatGPT within Learning Management Systems, could provide actionable, personalized feedback, empowering students to effectively address their writing challenges.

Also Read:

While the study demonstrates significant advancements, the authors acknowledge limitations, including the focus on Portuguese essays and the exploration of a limited range of large language models. Future research will aim to expand the approach to other text types, languages, and real-world educational environments, potentially leading to the development of practical learning analytics platforms for automated essay assessment and feedback.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Improving Automated Essay Cohesion Scoring with Item Response Theory

Gen AI News and Updates

Building Better AI for Education: A New Student Simulation Model

Adaptive Testing Reshapes LLM Evaluation for Efficiency and Accuracy

Unmasking Hidden Privacy Risks in Educational AI’s Learner Profiles

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates