TLDR: EQGBench is a new benchmark designed to evaluate how well Large Language Models (LLMs) can generate high-quality educational questions in Chinese, especially for middle school mathematics, physics, and chemistry. Unlike previous methods, EQGBench uses a five-dimensional framework focusing on pedagogical value, not just text similarity. The study evaluated 46 LLMs, finding that while models are good at basic question structure, they significantly struggle with generating questions that foster higher-order thinking and real-world application, particularly in mathematics. The benchmark aims to guide future LLM development for educational purposes.
Large Language Models (LLMs) have shown incredible skill in solving complex problems, especially in mathematics. However, a new research paper highlights a significant challenge: moving from simply providing answers to generating high-quality educational questions. This shift is crucial for effective teaching and learning, but it remains an underexplored area for AI.
Traditional methods for automatic question generation (AQG) often focus on creating questions from existing answers or contexts. These methods typically rely on metrics like BLEU and ROUGE, which measure how similar generated text is to reference text. The problem is, these metrics don’t capture the true educational value of a question. An effective educational question should guide a student’s thinking, encourage problem-solving, and foster deeper understanding, not just test factual recall. Current evaluation methods can’t tell the difference between a simple recall question and a complex problem requiring multi-step reasoning.
To address this critical gap, researchers have introduced EQGBench, a comprehensive benchmark specifically designed to evaluate LLMs’ performance in generating educational questions in Chinese. EQGBench is built on a carefully curated dataset of 900 evaluation samples, covering three core middle school subjects: mathematics, physics, and chemistry. The dataset includes diverse user requests, simulating real-world educational scenarios from the perspectives of teachers, students, and parents, with varying knowledge points, difficulty levels, and question types.
The benchmark employs a unique five-dimensional evaluation framework that aligns deeply with educational objectives. These dimensions include:
Knowledge Point Alignment (KP)
Ensures the generated question accurately reflects the specified topic.
Question Type Alignment (QT)
Checks if the question type (e.g., multiple-choice, fill-in-the-blank) matches the user’s request and follows standard formatting.
Question Item Quality (QQ)
Assesses clarity, unambiguous objectives, correct terminology, and solvability of the question.
Solution Explanation Quality (SQ)
Evaluates the correctness, rigor, and completeness of the provided explanation, ensuring it’s appropriate for the target academic level.
Also Read:
- Understanding AI’s Math Challenges: A Study on Language Model Accuracy and Collaborative Solutions
- Assessing AI’s Everyday Understanding: A New Benchmark for Multilingual Commonsense Reasoning
Competence-Oriented Guidance (CG)
Measures whether the question integrates realistic scenarios or cultural contexts, guiding students to apply knowledge and develop higher-order thinking skills.
Each dimension is scored on a three-level scale: Excellent (2), Good (1), and Poor (0).
The study conducted a systematic evaluation of 46 mainstream LLMs, including popular models from the ChatGPT, DeepSeek, and GLM series, with parameter sizes ranging from 7 billion to hundreds of billions. DeepSeek-R1 was used as the evaluator model, with a multi-round voting mechanism to ensure reliability.
The experimental results revealed several key insights. Models generally performed well on fundamental understanding tasks like knowledge point and question type alignment, showing strong ability to recognize and map basic question structures. However, in tasks requiring higher reasoning and logical ability, such as question item quality and solution explanation quality, there was a clear performance stratification, with larger general-purpose models like Doubao-1.5-thinking-pro and DeepSeek-R1 outperforming others. A significant finding was that the competence-oriented guidance dimension was the weakest across all models, especially in mathematics. This suggests that LLMs currently lack a strong ability to understand the deeper educational intent behind question design, particularly in more abstract subjects.
A human study involving six experienced middle school mathematics teachers validated the automated evaluation framework, showing high consistency between human and AI scores. This confirms the reliability and effectiveness of EQGBench.
In conclusion, EQGBench serves as a valuable resource for the academic community, highlighting the current strengths and significant areas for improvement in LLMs for educational question generation. While leading models possess strong foundational capabilities, they still struggle to generate questions with deep pedagogical intent and real-world applicability. This benchmark is expected to guide future optimization of LLMs for educational purposes. You can read the full paper here.


