TLDR: A study explored using AI to grade handwritten general chemistry exams, comparing AI scores to human grading across various question types. The AI system, gpt-o4-mini/high-vision, proved highly efficient and cost-effective, excelling in grading textual and chemical reaction questions. However, it showed lower reliability for numerical and graphical tasks, necessitating human oversight. The research proposes filtering strategies, including partial-credit thresholds, risk-based Bayesian methods, and problem-type exclusions, to optimize AI-human grading alignment. The findings suggest a hybrid approach where AI handles routine tasks, freeing human graders for complex cases, while emphasizing transparency and student trust in AI-enhanced assessment.
The challenge of grading handwritten, open-ended exams in large university courses is significant, often leading instructors to opt for simpler, closed-answer formats. However, these formats can limit the assessment of higher-order thinking and reasoning skills crucial in subjects like chemistry. A recent study explores how artificial intelligence (AI) can assist in grading these complex exams, aiming to balance efficiency with accuracy.
Researchers Jan Cvengros and Gerd Kortemeyer investigated the effectiveness and reliability of an AI-based grading system for a handwritten general chemistry exam at ETH Zurich. The exam included a variety of question types, such as chemical reaction equations, short and long open-ended answers, numerical and symbolic derivations, and even drawing and sketching tasks. Both student exam pages and grading rubrics were uploaded as images to the AI system.
The study utilized OpenAI’s multimodal reasoning model, gpt-o4-mini/high-vision, to grade the exams page by page. This process involved feeding the AI images of student work alongside the corresponding rubric. The efficiency gains were remarkable: grading 296 student exams took the AI approximately three hours and cost about $100 in tokens, compared to an estimated $2,250 for human teaching assistants (TAs) to grade the same number of exams. This highlights AI’s potential to make frequent formative and summative assessments more feasible without overwhelming instructional resources.
AI’s Performance: Strengths and Weaknesses
The AI system showed high agreement with human graders for textual questions and chemical reaction equations. This suggests that AI can effectively handle routine assessment tasks, potentially freeing up human graders to focus on more nuanced instructional activities. However, the AI demonstrated lower reliability for numerical and graphical tasks. For instance, it sometimes misread sketches, misinterpreted labels on diagrams, or failed to correctly identify markings in multiple-choice questions. There were also instances of “false positives,” where the AI awarded points for incorrect answers, and “false negatives,” where it incorrectly deducted points.
Interestingly, the AI was sometimes stricter or more consistent than human TAs. In one case, it correctly flagged an incorrect answer that a TA might have misread due to handwriting, highlighting a potential advantage in consistency and reduced susceptibility to perceptual errors.
Improving Reliability with Filtering Strategies
While the raw performance of the AI might be acceptable for low-stakes quizzes, it was deemed insufficient for high-stakes exams. To address this, the researchers explored several filtering strategies to improve the reliability of AI-assigned scores:
-
Partial-Credit Threshold: By setting a minimum threshold for accepting AI-assigned partial credit (e.g., only trusting scores above 50% or full credit), the precision of the AI’s grading improved significantly. This means that when the AI did assign a score, it was more likely to be correct, though it also meant more items were flagged for human review.
-
Risk-Based Filter: This method used Bayesian statistics and Item Response Theory (IRT) to estimate the probability of a student correctly solving a problem part. The AI’s score was then compared to this expectation, and only judgments within a certain “risk” tolerance were accepted. This approach proved efficient in aligning scores and reducing TA workload, though its complexity might be less transparent to students.
-
Problem-Type Filter: The most straightforward approach involved simply excluding problem parts with graphical components (drawing and graphing) from AI grading, leaving them entirely to human graders. This significantly improved overall reliability, especially when combined with other filters.
The study found that combining these filters, particularly excluding graphical problems and applying a partial-credit or risk-based threshold, led to substantial improvements in the alignment between AI and human grading. For instance, when only textual problems were considered and a 50% partial-credit threshold was applied, the AI’s total scores aligned much more faithfully with human scores.
Also Read:
- Voice-Enabled AI Tutors in Programming: A Study on Novice Learners
- AI’s Inner Critic: How Different Models Judge Vision-Language Descriptions
Implications for Education
The findings emphasize the necessity for human oversight to ensure grading accuracy, especially for complex or ambiguous student responses. The researchers recommend a hybrid approach: initially deploying AI grading for low-stakes assessments to build confidence, and for high-stakes exams, confidently accepting AI-graded full-credit responses while reserving partially correct or ambiguous answers for human review. This allows teaching assistants to focus their expertise on nuanced student misconceptions and provide deeper feedback.
Integrating AI into grading workflows also raises important considerations about student perceptions of fairness and trust. Clear communication with students about the role of AI in grading, along with continuous monitoring and recalibration of AI models, will be crucial for successful adoption. Ultimately, AI-assisted grading offers a promising pathway to maintain open-ended exam questions even with increasing student numbers and stagnant resources, enhancing overall educational quality by freeing up instructors for more meaningful interactions. For more details, you can read the full paper here.


