TLDR: This study explores using language models (Gemma 2B and GPT-3.5 175B) to automatically generate multiple-choice questions for K–12 morphological assessment. It compares a fine-tuned small model (Gemma) with a large untuned model (GPT-3.5) across seven prompting strategies. Results show that structured prompting, especially chain-of-thought and sequential designs, significantly improves Gemma’s output quality, making it comparable or superior to GPT-3.5 in pedagogical alignment, despite GPT-3.5’s better linguistic fluency. The study highlights the importance of combining automated metrics with expert and LLM-simulated evaluations for domain-specific content.
In the evolving landscape of K–12 education, the demand for high-quality assessment tools is constant. However, creating effective multiple-choice questions (MCQs) manually is a time-consuming and resource-intensive process. This challenge has led researchers to explore Automated Item Generation (AIG) using language models. A recent study delves into how different language models and prompting strategies can be leveraged to create MCQs for morphological assessment, aiming to make test development more efficient and consistent.
The research, titled “Prompting Strategies for Language Model-Based Item Generation in K–12 Education: Bridging the Gap Between Small and Large Language Models,” was conducted by Mohammad Amini, Babak Ahmadi, Xiaomeng Xiong, Yilin Zhang, and Christopher Qiao from the University of Florida. Their work addresses critical limitations in current AIG approaches, such as ensuring questions align with specific learning objectives, generating reliable incorrect options (distractors), and controlling the difficulty of items.
Comparing Language Models and Prompting Techniques
The study adopted a two-pronged approach. First, it compared a smaller, fine-tuned model, Gemma (2B parameters), with a much larger, off-the-shelf model, GPT-3.5 (175B parameters). The goal was to see if a smaller model, with careful tuning, could match or even surpass the performance of a larger model in a specialized domain like morphological assessment.
Second, the researchers evaluated seven structured prompting strategies. These included basic methods like zero-shot (giving only the task instruction) and few-shot (providing a few examples), as well as more advanced techniques such as chain-of-thought (CoT), which guides the model through step-by-step reasoning. They also explored role-based prompting (assigning roles like “teacher” or “psychometrician”) and sequential prompting (breaking the generation process into multiple stages), including combinations of these strategies.
Evaluation: Beyond Surface-Level Metrics
To assess the quality of the generated MCQs, the study used a comprehensive evaluation framework. This included automated metrics that measured grammar, complexity, readability, and fluency. However, recognizing that linguistic quality doesn’t always equate to educational effectiveness, the researchers also conducted human expert evaluations. Experts scored items on five key dimensions: instruction clarity, accuracy of the correct answer, quality of distractors, word difficulty appropriateness, and task difficulty alignment. To scale this human-aligned scoring, GPT-4.1, trained on expert-rated samples, was used to simulate human judgment.
Key Findings and Insights
The results revealed several important insights. While GPT-3.5 generally performed better in automated metrics like grammar and fluency when used out-of-the-box, Gemma’s performance significantly improved with structured prompting. Specifically, strategies combining chain-of-thought and sequential design led to Gemma producing items that were better aligned with morphological constructs and grade-level appropriateness. This suggests that even mid-sized models, when supported by effective prompting and fine-tuning, can generate high-quality, valid assessment items.
A crucial finding was the discrepancy between automated and human+GPT-4.1 evaluations. Automated metrics often favored GPT-3.5 for its linguistic coherence, but human and simulated expert scores frequently rated Gemma higher for its pedagogical appropriateness and morphological correctness. This highlights that for educational content, relying solely on linguistic metrics is insufficient; domain-specific criteria are paramount.
The study also underscored the importance of prompt design in maintaining construct validity. Prompts that explicitly guided the model to consider morphological details, such as breaking words into affixes or specifying reading levels, consistently reduced errors and improved the educational value of the items.
Also Read:
- AI-Driven Question Generation: A New Era for Text-Based Evaluation
- Boosting Grammar Correction Accuracy with LLMs and Rule-Based Reinforcement Learning
Implications for Educational Technology
For real-world deployment, the study offers two practical workflows. One involves using a large, off-the-shelf language model like GPT-3.5 for immediate, decent quality item generation, though it might lack domain-specific nuance without advanced prompting. The other, more cost-effective in the long run, is to use a mid-scale model like Gemma, fine-tuned and guided by carefully designed multi-step prompts, which can outperform larger models on domain-oriented scores.
The researchers emphasize that a human-in-the-loop or LLM-assisted review phase remains vital to ensure appropriate difficulty alignment, morphological accuracy, and minimal distractor confusion. This comprehensive approach, combining automated metrics with expert judgment and large-model simulation, is crucial for developing robust AIG pipelines in K–12 educational settings.
For more detailed information, you can access the full research paper here.


