Evaluating Student Text Answers with Large Language Models: A New Approach

TLDR: This research explores how Large Language Models (LLMs) can automatically score and provide feedback for student text answers in academic settings. Testing five different evaluation methods, the study found that “Reference Aided Evaluation,” which uses a correct answer as a guide, performed best, closely matching human evaluations. Other methods struggled with consistency, context, or flexibility, highlighting the importance of providing LLMs with clear reference information for accurate and insightful student assessment.

Large Language Models (LLMs) are rapidly changing how we think about automation, and one area showing significant promise is academic evaluation. A recent research paper delves into the capabilities of instruction-based LLMs to score and judge student text answers, known as Text-Input Problems (TIPs), within an academic environment. The goal is to provide students with useful, automated, and personalized feedback.

Exploring Different Evaluation Systems

The researchers proposed and tested five distinct evaluation systems using a custom dataset of 110 computer science answers from higher education students. These systems were compared against evaluations provided by a human expert. The models primarily used were Llama-3.1-8B and DeepSeek-R1-Distill-Llama-8B, alongside a fine-tuned model called JudgeLM.

Here’s a breakdown of the evaluation methods:

JudgeLM Evaluation: This method used a model specifically fine-tuned for judging, employing a single answer prompt to get a score.
Reference Aided Evaluation: This system used a correct answer as a guide, in addition to the original context of the question, to generate an evaluation.
No Reference Evaluation: This approach relied solely on the question’s context to produce an evaluation, without a reference answer.
Additive Evaluation: This method involved adding up points based on whether atomic (individual) criteria were met.
Adaptive Evaluation: This was a two-step process where criteria were first generated specifically for each question, and then the evaluation was performed using these custom criteria.

Key Findings: The Power of Reference Aided Evaluation

The study concluded that the Reference Aided Evaluation method was the most effective for automatically evaluating and scoring Text-Input Problems using LLMs. This method achieved the lowest median absolute deviation (0.945) and root mean square deviation (1.214) when compared to human evaluation, indicating fair scoring and comprehensive feedback. The inclusion of a reference answer proved vital for accurate assessments, guiding the LLM on what constitutes a good answer and the expected depth of understanding.

Other methods faced various challenges. JudgeLM evaluations struggled due to limitations in context length and its inability to adapt its judging framework from evaluating other LLMs to assessing student answers. Methods like Additive and Adaptive Evaluation, while aiming for customization, failed to provide good results for short, concise answers, often being too restrictive or over-reliant on the exact phrasing of a single reference. No Reference Evaluation, as expected, lacked the necessary information for consistently correct assessments.

Also Read:

Implications for Education

The research highlights the significant potential of AI-driven automatic evaluation systems, particularly when aided by robust methodologies like Reference Aided Evaluation. These systems can serve as valuable complementary tools in academic settings, enhancing the learning experience by providing timely and consistent feedback to students. While Llama3.1 generally yielded more satisfactory results in this experiment, the study suggests that larger and more specialized LLMs could offer even greater performance.

The findings underscore the importance of careful prompt design and the provision of relevant contextual information, such as reference answers, to maximize the accuracy and utility of LLM-based evaluators. This work paves the way for future advancements in automated academic assessment, potentially transforming how educators provide feedback and how students learn. For more detailed insights, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating Student Text Answers with Large Language Models: A New Approach

Exploring Different Evaluation Systems

Key Findings: The Power of Reference Aided Evaluation

Implications for Education

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates