spot_img
HomeResearch & DevelopmentAI's Role in Education: A Study on Generating High-Quality...

AI’s Role in Education: A Study on Generating High-Quality Exams

TLDR: A large-scale field study involving nearly 1700 students across 91 college classes found that AI-generated exam questions, created using an iterative refinement strategy, performed comparably to expert-created questions from standardized exams. The AI-generated questions were slightly easier but more discriminating, suggesting that AI can effectively produce high-quality, customized assessments, potentially reducing educator workload and increasing access to tailored instruction.

Artificial intelligence, particularly large language models (LLMs), is rapidly changing how we approach education. While these powerful tools present challenges, they also offer exciting opportunities to make teaching and learning more efficient and accessible. One promising area is the creation of customized exams, tailored to specific course content.

Historically, generating high-quality exam questions has been a time-consuming task for educators, often taking hours or even days. This can detract from their ability to engage with students in other meaningful ways. To address this, researchers have long explored automated methods for question generation. The advent of LLMs marks a new frontier in this field, with many recent efforts focusing on using AI to create exam questions.

A new study introduces and evaluates an innovative strategy for generating exam questions. This method involves an iterative refinement process, where questions are repeatedly produced, assessed, and improved through cycles of AI-generated critique and revision. This approach is similar to a technique called Self-Refine.

The researchers conducted a large-scale field study to evaluate the quality of these AI-generated questions. The study involved 91 classes across various subjects like computer science, mathematics, and chemistry, in dozens of colleges throughout the United States, with nearly 1700 students participating. The analysis, which used a standard method called item response theory (IRT), suggests that for the students in this study, the AI-generated questions performed comparably to questions created by human experts for standardized exams.

The study found that the AI-generated questions were, on average, slightly easier but also more effective at distinguishing between students of different abilities compared to expert-produced questions. This indicates that AI has the potential to make high-quality assessments more readily available, benefiting both teachers and students.

The process for generating AI exams involved two main stages. First, the AI independently generated questions for each class using instructor-provided course materials like descriptions, syllabi, and past assignments. This material was used as context for both the question generator and an AI judge. The system would generate a question, an AI judge would evaluate its quality (labeling it ‘good’ or ‘bad’), and this feedback would then be used to refine future question generation. This cycle continued until 20 ‘good’ questions were produced.

From these 20 questions, a final 10-question exam was assembled. These questions underwent a final round of AI judging to assess difficulty, appropriateness, and correctness of the provided answer. The 10 hardest questions, as evaluated by the AI judge, were selected for the final test. If fewer than 10 appropriate questions remained, more questions were generated and re-evaluated.

For benchmarking, the AI-generated questions were compared against high-quality human-generated questions from a publicly available 2012 AP Statistics practice exam. An LLM was used to select the 10 most appropriate questions from this bank for each statistics class in the study, prioritizing more difficult questions to align with college-level material.

The field study involved 182 classes at American colleges. Students took a common pre-test of quantitative reasoning skills at the beginning of the semester and a tailored exam (either AI-generated or standardized) near the end. The final dataset included responses from 91 classes and 1686 students. The results showed that AI-generated exams were more discriminating than standardized tests and were maximally informative for students with slightly below-average ability, though they still provided strong coverage across the ability spectrum.

Also Read:

While the study had some limitations, such as evaluating standardized questions only in statistics courses and focusing on multiple-choice questions, it demonstrates the significant potential of AI to create high-quality, customized assessments at scale across various subjects. This could substantially reduce instructor workloads, increase access to quality assessments, and potentially improve learning outcomes through more tailored instruction. For more details, you can read the full research paper here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -