AI Steps In: Enhancing Calculus Exam Grading with Multimodal Language Models

TLDR: A study investigated using GPT-5 to grade handwritten calculus exams. While unfiltered AI-TA agreement was moderate, a human-in-the-loop system combining partial-credit and IRT-based risk filtering significantly improved accuracy (R² up to 0.95) but required human review for about 70% of items under strict settings. The research highlights a workload-quality trade-off and suggests practical adjustments to exam design to optimize AI grading efficiency and reliability for routine cases, reserving expert judgment for complex responses.

In the realm of higher education, particularly in large-enrollment STEM courses like calculus, the challenge of grading open-ended, handwritten student work at scale is significant. Traditional machine-grading systems often fall short in evaluating complex multi-step reasoning, symbolic derivations, and graphical representations crucial for understanding mathematical concepts. This often pushes assessments towards closed-answer formats, which may not fully capture students’ true understanding.

A recent study by Gerd Kortemeyer, Alexander Caspar, and Daria Horica explores the potential of contemporary multimodal Large Language Models (LLMs) to assist in grading these intricate handwritten components of calculus exams. The research, detailed in their paper “Artificial-Intelligence Grading Assistance for Handwritten Components of a Calculus Exam”, investigates whether AI can provide reliable grading assistance without compromising the validity of assessment.

The researchers conducted their study using a large first-year university calculus exam. Students’ handwritten solutions were graded by GPT-5, an advanced multimodal LLM, using the same rubric employed by human teaching assistants (TAs). Unlike TAs who typically assign whole points, the AI was allowed to assign fractional credit. The TA’s rubric decisions served as the “ground truth” for comparison.

A crucial aspect of their methodology involved developing a “human-in-the-loop” filter. This filter combined two main components: a partial-credit threshold and an Item Response Theory (IRT) risk measure. The partial-credit threshold flagged items with very low AI scores for human review, acting as a conservative safeguard. The IRT risk measure assessed the deviation between the AI’s score and the score expected based on a student’s overall ability and the item’s difficulty. If the AI’s decision significantly diverged from the expected outcome, it was flagged as “high risk” and routed for human judgment.

The initial findings without any filtering showed a moderate agreement between AI and TA grades, with a coefficient of determination (R²) of approximately 0.85. While this level of agreement might be acceptable for low-stakes feedback, it was deemed insufficient for high-stakes examinations where precision is paramount. The AI also tended to be slightly more generous with points overall but was generally more conservative in its individual grading decisions.

However, the introduction of the confidence filter dramatically improved the accuracy. The study demonstrated a clear trade-off between the quality of AI grading and the human workload. Under stricter filter settings (e.g., a mild partial-credit threshold and a low-risk tolerance), the AI achieved near human-level accuracy, with R² values rising to approximately 0.95. The cost, however, was that roughly 70% of the items needed to be reviewed by human graders. Conversely, looser settings allowed for a higher auto-acceptance rate (around 81%) but with slightly lower accuracy (R² ≈ 0.89).

The researchers also identified several practical factors that influenced the AI’s performance and the effectiveness of the confidence filter. These included the relatively low weight of the open-ended portion of the exam, which led to inconsistent student effort; a small number of rubric checkpoints, which limited the granularity of assessment; and issues with exam layout, such as students writing outside designated answer regions or on loose sheets, which could cause the AI to miss relevant work.

To enhance the ceiling performance of AI-assisted grading, the study proposes several practical adjustments. These include increasing the assessment weight and protected time for open-ended items to encourage more consistent student effort, adding more granular rubric-visible substeps to improve assessment detail, and implementing stronger spatial anchoring on exam papers with clearly designated answer regions and registration marks. Improving the cleanliness of submissions, such as avoiding background grids and encouraging the use of pencils and erasers, was also suggested to aid OCR (Optical Character Recognition).

Also Read:

In conclusion, this research offers a pragmatic and optimistic outlook on AI’s role in educational assessment. It suggests that while AI may not fully replace human graders, a calibrated human-in-the-loop system can reliably manage a substantial portion of routine grading tasks for open-ended calculus problems. This approach frees up expert human judgment for more ambiguous, complex, or pedagogically rich student responses, ultimately making the grading process more scalable and efficient while preserving the educational value of assessing authentic mathematical reasoning.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI Steps In: Enhancing Calculus Exam Grading with Multimodal Language Models

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates