spot_img
HomeResearch & DevelopmentAI Assists in Grading Handwritten Chemistry Exams: A Hybrid...

AI Assists in Grading Handwritten Chemistry Exams: A Hybrid Approach for Efficiency and Accuracy

TLDR: A study explored using AI to grade handwritten general chemistry exams, comparing AI scores to human grading across various question types. The AI system, gpt-o4-mini/high-vision, proved highly efficient and cost-effective, excelling in grading textual and chemical reaction questions. However, it showed lower reliability for numerical and graphical tasks, necessitating human oversight. The research proposes filtering strategies, including partial-credit thresholds, risk-based Bayesian methods, and problem-type exclusions, to optimize AI-human grading alignment. The findings suggest a hybrid approach where AI handles routine tasks, freeing human graders for complex cases, while emphasizing transparency and student trust in AI-enhanced assessment.

The challenge of grading handwritten, open-ended exams in large university courses is significant, often leading instructors to opt for simpler, closed-answer formats. However, these formats can limit the assessment of higher-order thinking and reasoning skills crucial in subjects like chemistry. A recent study explores how artificial intelligence (AI) can assist in grading these complex exams, aiming to balance efficiency with accuracy.

Researchers Jan Cvengros and Gerd Kortemeyer investigated the effectiveness and reliability of an AI-based grading system for a handwritten general chemistry exam at ETH Zurich. The exam included a variety of question types, such as chemical reaction equations, short and long open-ended answers, numerical and symbolic derivations, and even drawing and sketching tasks. Both student exam pages and grading rubrics were uploaded as images to the AI system.

The study utilized OpenAI’s multimodal reasoning model, gpt-o4-mini/high-vision, to grade the exams page by page. This process involved feeding the AI images of student work alongside the corresponding rubric. The efficiency gains were remarkable: grading 296 student exams took the AI approximately three hours and cost about $100 in tokens, compared to an estimated $2,250 for human teaching assistants (TAs) to grade the same number of exams. This highlights AI’s potential to make frequent formative and summative assessments more feasible without overwhelming instructional resources.

AI’s Performance: Strengths and Weaknesses

The AI system showed high agreement with human graders for textual questions and chemical reaction equations. This suggests that AI can effectively handle routine assessment tasks, potentially freeing up human graders to focus on more nuanced instructional activities. However, the AI demonstrated lower reliability for numerical and graphical tasks. For instance, it sometimes misread sketches, misinterpreted labels on diagrams, or failed to correctly identify markings in multiple-choice questions. There were also instances of “false positives,” where the AI awarded points for incorrect answers, and “false negatives,” where it incorrectly deducted points.

Interestingly, the AI was sometimes stricter or more consistent than human TAs. In one case, it correctly flagged an incorrect answer that a TA might have misread due to handwriting, highlighting a potential advantage in consistency and reduced susceptibility to perceptual errors.

Improving Reliability with Filtering Strategies

While the raw performance of the AI might be acceptable for low-stakes quizzes, it was deemed insufficient for high-stakes exams. To address this, the researchers explored several filtering strategies to improve the reliability of AI-assigned scores:

  • Partial-Credit Threshold: By setting a minimum threshold for accepting AI-assigned partial credit (e.g., only trusting scores above 50% or full credit), the precision of the AI’s grading improved significantly. This means that when the AI did assign a score, it was more likely to be correct, though it also meant more items were flagged for human review.

  • Risk-Based Filter: This method used Bayesian statistics and Item Response Theory (IRT) to estimate the probability of a student correctly solving a problem part. The AI’s score was then compared to this expectation, and only judgments within a certain “risk” tolerance were accepted. This approach proved efficient in aligning scores and reducing TA workload, though its complexity might be less transparent to students.

  • Problem-Type Filter: The most straightforward approach involved simply excluding problem parts with graphical components (drawing and graphing) from AI grading, leaving them entirely to human graders. This significantly improved overall reliability, especially when combined with other filters.

The study found that combining these filters, particularly excluding graphical problems and applying a partial-credit or risk-based threshold, led to substantial improvements in the alignment between AI and human grading. For instance, when only textual problems were considered and a 50% partial-credit threshold was applied, the AI’s total scores aligned much more faithfully with human scores.

Also Read:

Implications for Education

The findings emphasize the necessity for human oversight to ensure grading accuracy, especially for complex or ambiguous student responses. The researchers recommend a hybrid approach: initially deploying AI grading for low-stakes assessments to build confidence, and for high-stakes exams, confidently accepting AI-graded full-credit responses while reserving partially correct or ambiguous answers for human review. This allows teaching assistants to focus their expertise on nuanced student misconceptions and provide deeper feedback.

Integrating AI into grading workflows also raises important considerations about student perceptions of fairness and trust. Clear communication with students about the role of AI in grading, along with continuous monitoring and recalibration of AI models, will be crucial for successful adoption. Ultimately, AI-assisted grading offers a promising pathway to maintain open-ended exam questions even with increasing student numbers and stagnant resources, enhancing overall educational quality by freeing up instructors for more meaningful interactions. For more details, you can read the full paper here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -