TLDR: A new research paper introduces an Item Response Theory (IRT)-based approach, called IRT-Multiregressor, to enhance the accuracy of automated essay cohesion assessment. By treating machine learning models as ‘respondents’ and essays as ‘items,’ IRT is used to adjust and combine predictions from various AI algorithms, including traditional models and BERT-based deep learning. Tested on two large Portuguese essay datasets, the method demonstrated superior performance in aligning automated scores with human evaluations, offering significant potential for personalized feedback, curriculum design, and explainable AI in educational settings.
Assessing the quality of written essays is a cornerstone of education, helping to evaluate students’ writing abilities and learning outcomes. Among the many facets of a good essay, textual cohesion stands out as crucial. Cohesion ensures that different parts of a text are meaningfully connected, making the writing clear and easy to understand. However, automatically scoring cohesion in essays has long been a significant challenge in educational artificial intelligence.
Traditional machine learning algorithms used for text evaluation often overlook the unique characteristics of individual essays. This can lead to inconsistencies between automated scores and human evaluations, making it difficult to provide accurate and helpful feedback to students. The subjective and time-consuming nature of manual essay assessment further highlights the need for robust automated solutions.
A recent research paper, titled Enhancing Essay Cohesion Assessment: A Novel Item Response Theory Approach, introduces an innovative method to tackle this problem. The study proposes an approach based on Item Response Theory (IRT) to refine and adjust the cohesion scores generated by machine learning models. IRT is a statistical framework commonly used in educational testing to model how individuals respond to test items based on their underlying abilities and the characteristics of the items themselves. In this novel application, machine learning models are treated as ‘respondents,’ and individual essays are considered ‘test items,’ allowing for a more nuanced analysis of prediction quality.
The researchers, Bruno Alexandre Rosa, Hilário Oliveira, Luiz Rodrigues, Eduardo Araujo Oliveira, and Rafael Ferreira Mello, aimed to investigate how effectively an IRT-based model could adjust cohesion scores predicted by various AI algorithms. Their approach, named IRT-Multiregressor, integrates IRT concepts to combine predictions from multiple machine learning models, including traditional algorithms and advanced BERT-based deep learning models.
For their experiments, two significant datasets of Portuguese essays were used: the extended Essay-BR dataset, comprising 6,563 essays in the style of Brazil’s National High School Exam (ENEM), and the Brazilian Portuguese Narrative Essays dataset, which includes 1,235 essays from 5th to 9th-grade public school students. These datasets provided a rich and diverse corpus for testing the proposed method, despite exhibiting imbalances in cohesion score distribution, which is a common challenge in such data.
To prepare the essays for machine learning, the team extracted a comprehensive set of 325 linguistic features. These features, categorized into 13 groups such as Coh-Metrix, LIWC, Connectives, and Lexical Diversity, transform textual data into numerical vectors that AI models can process. Additionally, the study leveraged BERT, a powerful pre-trained language model, to generate contextual embeddings, capturing deep semantic relationships within the essays.
Various regression algorithms, including Bayesian Ridge, CatBoost Regressor, and Support Vector Regressor (SVR), along with ensemble methods like Voting and Stacked Regressors, were trained to predict cohesion scores. The BERT language model, specifically BERTimbau-base and DistilBERT, was also fine-tuned for this task.
The core of the IRT-Multiregressor approach lies in its ability to adjust the predictions of these diverse models. After the initial predictions, the IRT framework calculates error expectations for each model and cohesion score range. These error expectations define confidence intervals, which are then used to identify the most frequent and reliable prediction among the models, ultimately leading to a more accurate final cohesion score.
The experimental results were promising. On the Essay-BR dataset, the IRT-Multiregressor All BERT approach significantly outperformed both traditional machine learning models and ensemble methods, as well as standalone BERT models, in metrics like Kappa and Quadratic Weighted Kappa (QWK). For instance, it achieved a Kappa score of 0.516 and a QWK of 0.656, demonstrating a notable improvement in agreement with human assessments. Similar positive trends were observed on the Brazilian Portuguese Narrative Essays dataset, where the IRT-Multiregressor All BERT approach again showed superior performance.
This research has significant implications for education. By providing more accurate and reliable automated cohesion scores, the IRT-Multiregressor approach can enable personalized feedback systems, helping students identify specific writing deficiencies like limited lexical diversity or inconsistent use of referential cohesion. Educators could then recommend targeted exercises, fostering self-regulated learning. Furthermore, insights from this model can inform curriculum design, highlighting common writing challenges across student populations and supporting data-driven pedagogical interventions.
The study also emphasizes the importance of ethical and explainable AI (xAI) in education. By integrating xAI techniques, the model could explain *why* a particular score was given, highlighting factors such as insufficient connectives or limited syntactic diversity. This transparency can build trust among educators and students and, when combined with generative AI tools like ChatGPT within Learning Management Systems, could provide actionable, personalized feedback, empowering students to effectively address their writing challenges.
Also Read:
- Evaluating Large Language Models for Argument Classification: A Deep Dive into Performance and Pitfalls
- Boosting Speech AI Performance Through Smart Data Generation
While the study demonstrates significant advancements, the authors acknowledge limitations, including the focus on Portuguese essays and the exploration of a limited range of large language models. Future research will aim to expand the approach to other text types, languages, and real-world educational environments, potentially leading to the development of practical learning analytics platforms for automated essay assessment and feedback.


