TLDR: A new AI framework, YourBench4Edu, generates diverse and difficulty-adapted comprehension questions for K-2 English learners from various learning materials. It leverages large language models and a multi-step process (ingestion, summarization, segmentation, question generation) and shows state-of-the-art performance on the FairytaleQA dataset, aiming to support autonomous AI-driven English instructors.
Assessing how well young children understand what they read is a crucial part of their journey to becoming proficient readers. Traditionally, this might involve adults asking questions during story time, a method known as dialogic reading, which has been shown to significantly boost a child’s language development and comprehension. Building on this idea, researchers have developed an innovative AI-driven approach to generate comprehension questions specifically designed for kindergarten to second-grade English learners.
The new framework, called YourBench4Edu, is an adaptation of an existing system named YourBench, which was originally created for evaluating large language models (LLMs) in question answering. YourBench4Edu focuses on creating high-quality question-and-answer pairs from various learning materials, making it a valuable tool for educators to quickly prepare assessment content or for conversational AI agents to facilitate interactive reading experiences.
How YourBench4Edu Works
The process begins with the ‘ingestion’ component, which takes learning materials in different formats, such as PDFs or HTML files, and converts them into a standardized text format. Next, the ‘summarization’ component breaks down the text into smaller pieces, summarizes each part using a language model, and then integrates these summaries into a well-structured overview. Following this, the ‘segmentation’ step divides the text into relevant chunks, which serve as the basis for generating questions.
Finally, the ‘question generation’ component uses a language model to create diverse questions. Educators or AI systems can customize the types of questions (e.g., true-false, factual, analytical), the difficulty level, and even the number of questions. The system can generate ‘single-shot’ questions, which are based on a single text chunk, or ‘multi-hop’ questions, which require information from multiple chunks, ensuring a comprehensive evaluation of understanding.
Validating the Approach
To test the effectiveness of YourBench4Edu, the researchers used the FairytaleQA dataset, a collection of narrative comprehension questions for students from kindergarten to eighth grade. The framework was adapted to generate questions based on given answers, a common scenario in assessment. The performance was measured using metrics like MAP@N with Rouge-L F1 and BERTScore F1, which compare the generated questions to human-created ones.
The results showed that YourBench4Edu, when powered by various language models such as Llama-3.3-70B-Instruct, Qwen3-235B-A22B, and QwQ-32B, significantly outperformed previous methods in terms of Rouge-L F1 scores, indicating the high quality of the generated questions. While it maintained a strong performance in BERTScore F1, it demonstrated its capability to produce state-of-the-art questions for early literacy assessment.
Also Read:
- AI Agents Collaborate to Understand Long Documents
- EduAlign: A Framework for Crafting Smarter, More Engaging AI Tutors
The Future of Reading Assessment
This novel approach holds significant promise for the future of education. By enabling the quick and easy generation of diverse, difficulty-adapted comprehension questions, YourBench4Edu has the potential to become a vital component of autonomous AI-driven English instructors. This could transform how reading comprehension is assessed, making it more dynamic, personalized, and effective for young learners. You can read the full research paper here: Question Generation for Assessing Early Literacy Reading Comprehension.


