spot_img
HomeResearch & DevelopmentFarsiMCQGen: Advancing Persian Multiple-Choice Question Generation

FarsiMCQGen: Advancing Persian Multiple-Choice Question Generation

TLDR: FarsiMCQGen is a new framework for automatically generating high-quality multiple-choice questions (MCQs) in Persian. Developed by researchers at Amirkabir University of Technology, it combines advanced techniques like Transformers and knowledge graphs for question generation and sophisticated candidate generation, filtering, and ranking for creating plausible distractors. The framework also introduces a novel 10,289-question Persian MCQ dataset. Evaluations, both automated with LLMs and human-based, confirm the validity and effectiveness of the generated questions and choices, offering a significant resource for Persian language education and NLP research.

Creating effective multiple-choice questions (MCQs) is a cornerstone of educational assessment, offering an efficient way to gauge a learner’s understanding. However, this task becomes particularly challenging when dealing with low-resource languages like Persian, where specialized tools and datasets are scarce. Manual MCQ generation is also a time-consuming process that demands significant expertise.

Addressing this challenge, researchers Mohammad Heydari Rad, Rezvan Afari, and Saeedeh Momtazi from Amirkabir University of Technology have introduced FarsiMCQGen, an innovative framework designed to automatically generate high-quality Persian-language MCQs. This new approach aims to streamline the creation of educational content and support language learning in Persian.

The FarsiMCQGen framework operates through a sophisticated, multi-stage process that combines advanced natural language processing techniques with rule-based methods. It focuses on generating not just the questions, but also credible ‘distractors’—the incorrect answer choices that are designed to challenge test-takers effectively.

The system’s architecture is divided into two main components: question generation and wrong choice (distractor) generation.

Generating Questions

For question generation, FarsiMCQGen utilizes a fine-tuned mT5-based model. This model is trained on the PQuAD dataset, a large-scale Persian question-answering dataset derived from Wikipedia. By feeding the model an answer and its corresponding text, it learns to formulate contextually relevant questions.

Also Read:

Crafting Distractors

The generation of wrong choices is a critical aspect of creating effective MCQs. FarsiMCQGen employs a three-step process for this:

1. Candidate Generation: This involves two methods. The first uses a ‘fill-mask’ technique with various Transformer-based language models (like ParsBERT and ALBERT-Persian). It takes a complete answer sentence, masks the correct answer, and then predicts plausible alternatives. The second method identifies words semantically similar to the correct answer using GloVe and Word2Vec embeddings, which are trained on a Persian Wikipedia corpus.

2. Filtering: To ensure quality and efficiency, unsuitable candidates are filtered out. This includes a Part-Of-Speech (POS) filter to ensure grammatical consistency, a Written Form filter to standardize numerical representations (e.g., converting ‘2’ to ‘two’ or vice-versa), and a Named Entity Recognition (NER) filter to match entity types (e.g., ensuring a person’s name distractor for a person’s name answer).

3. Ranking and Selection: The filtered candidates are then ranked using two similarity approaches. A Knowledge Graph Embedding Similarity leverages FarsWikiKG, a Persian knowledge graph, to assess relationships between entities. A BERT Similarity module calculates the semantic similarity between the correct answer and each distractor within the given context. The top three candidates, based on a combined score from these two methods, are selected as the final wrong choices.

The research also introduces a new Persian MCQ dataset comprising 10,289 questions, categorized by both type (e.g., ‘What’, ‘When’, ‘Where’) and content (e.g., History, Technology, Science, Politics). This dataset serves as a valuable resource for further research and development in Persian NLP.

To validate the quality of the generated questions and distractors, both automated and human evaluations were conducted. Several state-of-the-art large language models (LLMs) were tested, with models like Qwen2.5-14B-Instruct and Meta-Llama-3.1-8B-Instruct showing strong performance. Human evaluators assessed a sample of 200 questions, confirming that 97.5% of the questions and options were logically valid, and 94.5% of the wrong choices were effectively distractive.

This work marks a significant step forward in automatic MCQ generation for the Persian language, offering a robust framework and a high-quality dataset that can inspire future advancements in educational technology and language processing. For more details, you can refer to the full research paper: FARSIMCQGEN: A PERSIAN MULTIPLE-CHOICE QUESTION GENERATION FRAMEWORK.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -