Enhancing K–12 Assessment Item Generation with Language Models and Smart Prompting

TLDR: This study explores using language models (Gemma 2B and GPT-3.5 175B) to automatically generate multiple-choice questions for K–12 morphological assessment. It compares a fine-tuned small model (Gemma) with a large untuned model (GPT-3.5) across seven prompting strategies. Results show that structured prompting, especially chain-of-thought and sequential designs, significantly improves Gemma’s output quality, making it comparable or superior to GPT-3.5 in pedagogical alignment, despite GPT-3.5’s better linguistic fluency. The study highlights the importance of combining automated metrics with expert and LLM-simulated evaluations for domain-specific content.

In the evolving landscape of K–12 education, the demand for high-quality assessment tools is constant. However, creating effective multiple-choice questions (MCQs) manually is a time-consuming and resource-intensive process. This challenge has led researchers to explore Automated Item Generation (AIG) using language models. A recent study delves into how different language models and prompting strategies can be leveraged to create MCQs for morphological assessment, aiming to make test development more efficient and consistent.

The research, titled “Prompting Strategies for Language Model-Based Item Generation in K–12 Education: Bridging the Gap Between Small and Large Language Models,” was conducted by Mohammad Amini, Babak Ahmadi, Xiaomeng Xiong, Yilin Zhang, and Christopher Qiao from the University of Florida. Their work addresses critical limitations in current AIG approaches, such as ensuring questions align with specific learning objectives, generating reliable incorrect options (distractors), and controlling the difficulty of items.

Comparing Language Models and Prompting Techniques

The study adopted a two-pronged approach. First, it compared a smaller, fine-tuned model, Gemma (2B parameters), with a much larger, off-the-shelf model, GPT-3.5 (175B parameters). The goal was to see if a smaller model, with careful tuning, could match or even surpass the performance of a larger model in a specialized domain like morphological assessment.

Second, the researchers evaluated seven structured prompting strategies. These included basic methods like zero-shot (giving only the task instruction) and few-shot (providing a few examples), as well as more advanced techniques such as chain-of-thought (CoT), which guides the model through step-by-step reasoning. They also explored role-based prompting (assigning roles like “teacher” or “psychometrician”) and sequential prompting (breaking the generation process into multiple stages), including combinations of these strategies.

Evaluation: Beyond Surface-Level Metrics

To assess the quality of the generated MCQs, the study used a comprehensive evaluation framework. This included automated metrics that measured grammar, complexity, readability, and fluency. However, recognizing that linguistic quality doesn’t always equate to educational effectiveness, the researchers also conducted human expert evaluations. Experts scored items on five key dimensions: instruction clarity, accuracy of the correct answer, quality of distractors, word difficulty appropriateness, and task difficulty alignment. To scale this human-aligned scoring, GPT-4.1, trained on expert-rated samples, was used to simulate human judgment.

Key Findings and Insights

The results revealed several important insights. While GPT-3.5 generally performed better in automated metrics like grammar and fluency when used out-of-the-box, Gemma’s performance significantly improved with structured prompting. Specifically, strategies combining chain-of-thought and sequential design led to Gemma producing items that were better aligned with morphological constructs and grade-level appropriateness. This suggests that even mid-sized models, when supported by effective prompting and fine-tuning, can generate high-quality, valid assessment items.

A crucial finding was the discrepancy between automated and human+GPT-4.1 evaluations. Automated metrics often favored GPT-3.5 for its linguistic coherence, but human and simulated expert scores frequently rated Gemma higher for its pedagogical appropriateness and morphological correctness. This highlights that for educational content, relying solely on linguistic metrics is insufficient; domain-specific criteria are paramount.

The study also underscored the importance of prompt design in maintaining construct validity. Prompts that explicitly guided the model to consider morphological details, such as breaking words into affixes or specifying reading levels, consistently reduced errors and improved the educational value of the items.

Also Read:

Implications for Educational Technology

For real-world deployment, the study offers two practical workflows. One involves using a large, off-the-shelf language model like GPT-3.5 for immediate, decent quality item generation, though it might lack domain-specific nuance without advanced prompting. The other, more cost-effective in the long run, is to use a mid-scale model like Gemma, fine-tuned and guided by carefully designed multi-step prompts, which can outperform larger models on domain-oriented scores.

The researchers emphasize that a human-in-the-loop or LLM-assisted review phase remains vital to ensure appropriate difficulty alignment, morphological accuracy, and minimal distractor confusion. This comprehensive approach, combining automated metrics with expert judgment and large-model simulation, is crucial for developing robust AIG pipelines in K–12 educational settings.

For more detailed information, you can access the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing K–12 Assessment Item Generation with Language Models and Smart Prompting

Comparing Language Models and Prompting Techniques

Evaluation: Beyond Surface-Level Metrics

Key Findings and Insights

Implications for Educational Technology

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Artificial Intelligence Revolutionizes Educator Development and Personalized Learning, New Studies Reveal

DiagramIR: Advancing Automated Evaluation for Educational Math Diagrams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates