TLDR: This research introduces AuDITR (Automated Detection of Inauthentic Templated Responses), a machine learning system designed to identify when test takers use memorized templates in high-stakes English Language Assessments to artificially inflate their scores. The study details the development of a feature-based random forest model, the challenges of data annotation, and the critical need for continuous model updates due to “adaptive adversarial drift” where test takers adapt to detection methods. The findings show the model’s effectiveness in maintaining high precision, crucial for fair assessment, and discuss future research directions to enhance detection capabilities.
In the world of high-stakes English Language Proficiency (ELP) tests, where scores can open doors to education, employment, and even national residency, the integrity of assessment is paramount. However, a growing challenge has emerged: low-skill test takers employing memorized “templates” in essay questions to “game” automated scoring systems and achieve higher scores than their true ability warrants.
A new research paper, “Automatic Detection of Inauthentic Templated Responses in English Language Assessments,” introduces a crucial task: the automated detection of inauthentic, templated responses (AuDITR). This study, conducted by Yashad Samant, Lee Becker, Scott Hellman, Bradley Behan, Sarah Hughes, and Joshua Southerland, describes a machine learning-based approach to tackle this problem and highlights the importance of regularly updating these detection models in real-world applications. You can read the full paper here.
The Challenge of Templated Responses
Unlike simply memorizing a high-scoring essay, templates offer a more flexible strategy. They are fixed texts with “slots” that test takers can customize to fit various essay prompts. This allows them to apply a single template to many different questions, making it a popular tactic. A quick online search for terms like “PTE templates” reveals a vast ecosystem of forums, social media influencers, and test preparation companies promoting these templates as a guaranteed path to high scores.
These templates are designed to mimic effective writing, incorporating advanced vocabulary, discourse markers, varied sentence structures, and minimal errors. The key difference is the inclusion of gaps for customization, which allows the test taker to superficially address the prompt without demonstrating genuine language proficiency.
Developing a Detection System: AuDITR
To build a system capable of identifying these templated responses, the researchers first needed data. They developed a corpus of responses from the Pearson Test of English essay items. Initially, they considered a simple binary distinction (templated vs. non-templated), but found it difficult to label consistently, especially for responses that mixed templated and authentic writing. They opted for a multi-label scoring scheme: None, Low, or High templating detected. For the machine learning experiments, these were simplified into a binary classification: High gaming (labeled 1) versus Low or No gaming (labeled 0).
The AuDITR model doesn’t operate in isolation; it works alongside automated essay scoring models. A critical requirement for this system is high precision, meaning it must be very accurate in identifying templated responses to avoid unfairly penalizing legitimate test takers. Instead of relying solely on complex deep learning models like Transformers, which can overfit to known templates, the researchers engineered interpretable features. These features quantify how much of a response overlaps with known templates and the prompt text itself. For instance, features include the “number of non-template tokens” and “percent of authentic tokens,” which measure the parts of the response that are unique to the test taker and not part of a template or the prompt.
The model uses a random forest classifier, a type of machine learning algorithm, to make its predictions. Experiments showed that while there was a slight drop in performance from training to testing, the model achieved 100% precision on the test set, meaning it had no false positives. This is a significant achievement for a system where false accusations could have serious consequences for test takers.
The “Cat-and-Mouse” Game: Adaptive Adversarial Drift
One of the most critical findings of the research is the phenomenon of “adaptive adversarial drift.” After the initial AuDITR model was deployed, the detection rate began to decline. This wasn’t because test takers stopped using templates, but because they switched to new ones. Online discussions and influencers quickly adapted, advising test takers that the “algorithm had changed” and promoting new strategies.
This dynamic creates a continuous “cat-and-mouse” game: each model release causes a spike in detection rates, followed by a decay as test takers find workarounds, necessitating further model updates. To combat this, the researchers not only manually updated their template lists but also developed new processes for automatically discovering new templates and sub-templates.
Also Read:
- Collaborative AI for Education: Addressing Privacy and Personalization with Federated Foundation Models
- HALT-RAG: A Smart Approach to Verifying AI-Generated Content
Looking Ahead
Despite the ongoing challenge of adaptive adversarial drift, the researchers are encouraged by a shift in online advice. Videos and forums are increasingly reminding test takers that templates alone are insufficient for high scores; they must write topically, insert more than single words into gaps, and ensure their added text is grammatically correct. Essentially, to succeed, test takers are being advised to demonstrate genuine English proficiency.
Future research aims to further enhance AuDITR by exploring techniques from forensic linguistics, author identification, and plagiarism detection. There’s also promise in using generative language models to understand templated behavior more deeply and in developing “drift-sensitive learners” that can automatically adapt to new adversarial strategies, reducing the need for manual updates. The move back to a ternary labeling system for human scoring is also planned, which will provide richer data for model training and evaluation.


