AI-Driven Question Generation: A New Era for Text-Based Evaluation

TLDR: This research introduces an Automatic Question Answer Generation (AQAG) system using a fine-tuned Meta-Llama 2-7B Large Language Model. The system aims to simplify the creation of diverse text-based assessments for educators and provide self-evaluation tools for students. By leveraging prompt engineering and training on the RACE dataset, the model generates various question types (MCQ, conceptual, factual) and their answers, demonstrating its potential to save time and resources in educational and professional evaluation processes, despite facing hardware limitations and some biases in question generation.

In today’s fast-paced educational landscape, the process of creating effective and fair assessments for students is more crucial than ever. However, educators often face significant challenges in manually developing diverse sets of questions from extensive lecture materials. This time-consuming task can limit the variety of questions and delay valuable feedback for students. Recognizing this critical need, a recent research paper explores an innovative solution: Automatic Question Answer Generation (AQAG) powered by advanced Artificial Intelligence.

Authored by A.S.M Mehedi Hasan, Md. Alvee Ehsan, Kefaya Benta Shahnoor, and Syeda Sumaiya Tasneem from Brac University, this study introduces a system designed to streamline the assessment process. Their work, detailed in the paper titled “Automatic Question & Answer Generation Using Generative Large Language Model (LLM)”, aims to free up valuable time and resources for educators and individuals involved in text-based evaluations. You can read the full research paper here.

The Challenge of Manual Assessment

The traditional method of creating questions, whether multiple-choice, conceptual, or factual, requires instructors to sift through numerous lecture materials. This manual effort often leads to a lack of diversity in questions and can be a significant burden. From a student’s perspective, timely feedback is essential for self-evaluation and identifying areas for improvement before formal assessments. Beyond education, the corporate world also grapples with the challenge of designing unbiased and effective assessments for hiring, where prioritizing skills over CV information is key.

AI’s Role in Transforming Q&A Generation

The researchers propose leveraging the power of Large Language Models (LLMs) to automate this process. Their system utilizes a fine-tuned generative LLM, specifically the Meta-Llama 2-7B model, which has been adapted to understand and generate human-like text. The core idea is to train this model on a vast dataset of reading comprehensions and questions, enabling it to create new, contextually relevant questions and their corresponding answers.

A key technique employed in this research is Prompt Engineering. This involves carefully crafting instructions and examples for the AI model to guide it towards generating questions in a preferred style, such as multiple-choice questions (MCQs), or questions that test conceptual or factual understanding. This ensures that the generated questions align with the instructor’s specific requirements.

Behind the Scenes: How the Model Works

The system’s development involved several crucial steps. First, the Meta-Llama 2-7B model, a powerful AI with 7 billion parameters, was chosen as the foundation. This model was then fine-tuned using the RACE dataset, which comprises 10,000 reading comprehensions and 40,000 questions derived from English tests given to middle and high school students in China. This extensive dataset helped the model learn the nuances of question-answer relationships.

To make the model more efficient and capable of running on systems with limited resources, a technique called 4-bit quantization was applied. This process simplifies the model’s internal calculations, significantly reducing its memory footprint without a major loss in performance. Additionally, text was broken down into smaller units called “tokens” through tokenization, making it easier for the AI to process and understand language.

Evaluating the System’s Performance

The fine-tuned model’s performance was evaluated using various metrics. One important measure was “perplexity,” which assesses how well a language model predicts a text sample. A lower perplexity score indicates better prediction confidence. The custom fine-tuned Llama-2 model achieved a perplexity score of 6.43 on a standard test set, showing its capability in understanding and generating text.

The researchers also measured the relevance of the generated questions to their source articles using a “cosine similarity” score, which quantifies how semantically similar two pieces of text are. Questions consistently showed good relevance scores, indicating they were pertinent to the context. Furthermore, the quality of multiple-choice options was assessed by measuring their similarity to the correct answer, ensuring that all options were plausible and well-categorized.

Also Read:

Impact and Future Directions

This AQAG system holds immense potential for various sectors. In education, it can empower faculties to quickly generate diverse assessments and provide students with tools for self-practice and evaluation. In the corporate world, it could assist in creating unbiased skill-based assessments for job applicants. The research also contributes to a deeper understanding of the capabilities of newer LLMs like Llama-2, which was open-sourced recently.

While promising, the project encountered challenges, including hardware limitations that necessitated the use of a 4-bit quantized model, potentially affecting peak performance. The study also noted some limitations, such as a bias towards generating conceptual questions due to the training data and the current inability to process analytical questions or direct PDF documents. Future work aims to address these by exploring newer LLMs, investigating biases using Explainable AI (XAI) techniques, and expanding the model’s capabilities to include multilingual support and analytical question generation.

Ultimately, this research paves the way for a more dynamic and efficient approach to text-based evaluation, promising significant benefits across educational and professional domains.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI-Driven Question Generation: A New Era for Text-Based Evaluation

The Challenge of Manual Assessment

AI’s Role in Transforming Q&A Generation

Behind the Scenes: How the Model Works

Evaluating the System’s Performance

Impact and Future Directions

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

New Jersey Educators Navigate the Integration of AI in Classrooms with Caution and Optimism

Artificial Intelligence Revolutionizes Educator Development and Personalized Learning, New Studies Reveal

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates