AI Assists in Grading Handwritten Chemistry Exams: A Hybrid Approach for Efficiency and Accuracy

TLDR: A study explored using AI to grade handwritten general chemistry exams, comparing AI scores to human grading across various question types. The AI system, gpt-o4-mini/high-vision, proved highly efficient and cost-effective, excelling in grading textual and chemical reaction questions. However, it showed lower reliability for numerical and graphical tasks, necessitating human oversight. The research proposes filtering strategies, including partial-credit thresholds, risk-based Bayesian methods, and problem-type exclusions, to optimize AI-human grading alignment. The findings suggest a hybrid approach where AI handles routine tasks, freeing human graders for complex cases, while emphasizing transparency and student trust in AI-enhanced assessment.

The challenge of grading handwritten, open-ended exams in large university courses is significant, often leading instructors to opt for simpler, closed-answer formats. However, these formats can limit the assessment of higher-order thinking and reasoning skills crucial in subjects like chemistry. A recent study explores how artificial intelligence (AI) can assist in grading these complex exams, aiming to balance efficiency with accuracy.

Researchers Jan Cvengros and Gerd Kortemeyer investigated the effectiveness and reliability of an AI-based grading system for a handwritten general chemistry exam at ETH Zurich. The exam included a variety of question types, such as chemical reaction equations, short and long open-ended answers, numerical and symbolic derivations, and even drawing and sketching tasks. Both student exam pages and grading rubrics were uploaded as images to the AI system.

The study utilized OpenAI’s multimodal reasoning model, gpt-o4-mini/high-vision, to grade the exams page by page. This process involved feeding the AI images of student work alongside the corresponding rubric. The efficiency gains were remarkable: grading 296 student exams took the AI approximately three hours and cost about $100 in tokens, compared to an estimated $2,250 for human teaching assistants (TAs) to grade the same number of exams. This highlights AI’s potential to make frequent formative and summative assessments more feasible without overwhelming instructional resources.

AI’s Performance: Strengths and Weaknesses

The AI system showed high agreement with human graders for textual questions and chemical reaction equations. This suggests that AI can effectively handle routine assessment tasks, potentially freeing up human graders to focus on more nuanced instructional activities. However, the AI demonstrated lower reliability for numerical and graphical tasks. For instance, it sometimes misread sketches, misinterpreted labels on diagrams, or failed to correctly identify markings in multiple-choice questions. There were also instances of “false positives,” where the AI awarded points for incorrect answers, and “false negatives,” where it incorrectly deducted points.

Interestingly, the AI was sometimes stricter or more consistent than human TAs. In one case, it correctly flagged an incorrect answer that a TA might have misread due to handwriting, highlighting a potential advantage in consistency and reduced susceptibility to perceptual errors.

Improving Reliability with Filtering Strategies

While the raw performance of the AI might be acceptable for low-stakes quizzes, it was deemed insufficient for high-stakes exams. To address this, the researchers explored several filtering strategies to improve the reliability of AI-assigned scores:

Partial-Credit Threshold: By setting a minimum threshold for accepting AI-assigned partial credit (e.g., only trusting scores above 50% or full credit), the precision of the AI’s grading improved significantly. This means that when the AI did assign a score, it was more likely to be correct, though it also meant more items were flagged for human review.
Risk-Based Filter: This method used Bayesian statistics and Item Response Theory (IRT) to estimate the probability of a student correctly solving a problem part. The AI’s score was then compared to this expectation, and only judgments within a certain “risk” tolerance were accepted. This approach proved efficient in aligning scores and reducing TA workload, though its complexity might be less transparent to students.
Problem-Type Filter: The most straightforward approach involved simply excluding problem parts with graphical components (drawing and graphing) from AI grading, leaving them entirely to human graders. This significantly improved overall reliability, especially when combined with other filters.

The study found that combining these filters, particularly excluding graphical problems and applying a partial-credit or risk-based threshold, led to substantial improvements in the alignment between AI and human grading. For instance, when only textual problems were considered and a 50% partial-credit threshold was applied, the AI’s total scores aligned much more faithfully with human scores.

Also Read:

Implications for Education

The findings emphasize the necessity for human oversight to ensure grading accuracy, especially for complex or ambiguous student responses. The researchers recommend a hybrid approach: initially deploying AI grading for low-stakes assessments to build confidence, and for high-stakes exams, confidently accepting AI-graded full-credit responses while reserving partially correct or ambiguous answers for human review. This allows teaching assistants to focus their expertise on nuanced student misconceptions and provide deeper feedback.

Integrating AI into grading workflows also raises important considerations about student perceptions of fairness and trust. Clear communication with students about the role of AI in grading, along with continuous monitoring and recalibration of AI models, will be crucial for successful adoption. Ultimately, AI-assisted grading offers a promising pathway to maintain open-ended exam questions even with increasing student numbers and stagnant resources, enhancing overall educational quality by freeing up instructors for more meaningful interactions. For more details, you can read the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI Assists in Grading Handwritten Chemistry Exams: A Hybrid Approach for Efficiency and Accuracy

AI’s Performance: Strengths and Weaknesses

Improving Reliability with Filtering Strategies

Implications for Education

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates