Evaluating Multimodal AI for Grading Handwritten Student Math Work

TLDR: A new study investigates multimodal large language models (MLLMs) for grading handwritten student math work. It finds that MLLMs achieve near-human accuracy (95%) on routine arithmetic problems with objective answers. However, they struggle significantly with mathematical illustrations, even when provided with human descriptions to overcome visual interpretation challenges, plateauing at moderate agreement levels (Kappa ≈ 0.47). This suggests MLLMs currently lack the ‘tacit knowledge’ and pedagogical judgment that human educators use to interpret nuanced student thinking, highlighting the need for hybrid human-AI systems in educational assessment.

Recent advancements in artificial intelligence, particularly with multimodal large language models (MLLMs), have opened up exciting possibilities for automating tasks that traditionally require human interpretation. One such area is the grading and analysis of handwritten student work, especially in subjects like elementary and middle-school mathematics where most assignments are still completed by hand. This capability could significantly reduce the time teachers spend on grading, allowing them to focus more on providing personalized feedback and understanding students’ learning processes.

A new study by Owen Henkel, Bill Roberts, Doug Jaffe, and Laurence Holt explores the effectiveness of MLLMs in interpreting and grading handwritten student mathematics. The researchers aimed to answer three key questions: how accurately can MLLMs assess handwritten arithmetic with objective answers, how does their performance change when evaluating mathematical illustrations, and can we distinguish between the models’ visual and pedagogical capabilities?

Experiment A: Assessing Numerical Calculations

The first experiment focused on evaluating MLLMs’ ability to grade routine arithmetic problems with clear, objective answers. The researchers used a dataset of 288 handwritten responses from middle school students in Ghana, involving fractions, percentages, and basic algebra. These problems required students to show their work and provide final answers. The assessment task was broken down into two parts: a ‘vision task’ (identifying the numerical answer written by the student) and a ‘grading task’ (determining if the identified answer was mathematically correct).

Four state-of-the-art MLLMs were tested: Claude 3.5 Sonnet, Claude 3.7, Gemini 2.5 Pro, and GPT-4.1. The results showed that Gemini 2.5 Pro significantly outperformed the other models, achieving an impressive 95% grading accuracy and a high agreement level (Kappa = 0.90) with human experts. This suggests that for straightforward arithmetic, MLLMs, especially the more advanced ones, can achieve near-human accuracy. Interestingly, some models even showed grading performance exceeding their visual interpretation accuracy, implying they might use mathematical context to compensate for imperfect handwriting recognition. However, the study also noted some puzzling errors, such as models penalizing correct final answers due to untidy or flawed intermediate steps, which human educators would be unlikely to do.

Experiment B: Interpreting Mathematical Illustrations

The second experiment presented a more complex challenge: evaluating mathematical illustrations and diagrams. This involved 150 student-drawn responses from American elementary students, featuring number lines, geometric shapes, and other visual representations where the drawing itself is the answer. Unlike arithmetic, interpreting these illustrations often requires pedagogical judgment, as students might use non-standard notation or partially correct approaches.

To understand the impact of visual interpretation challenges, two conditions were set up: ‘Model-Only’ (models viewed only the image) and ‘Human-Enhanced’ (models received high-quality human descriptions of the visual content alongside the image). When models had to interpret the images directly, their performance was considerably lower. However, when provided with human descriptions, all models showed substantial performance gains. Claude 3.7, for instance, saw a significant improvement in its agreement score (Kappa gain of +0.32). This indicates that visual interpretation is a major hurdle for MLLMs when dealing with complex student drawings.

Despite the improvements with human descriptions, the models’ performance plateaued at an agreement level (Kappa ≈ 0.43-0.47) similar to initial human inter-rater agreement before calibration. While this shows progress, it’s still considered insufficient for autonomous deployment in real-world educational settings.

The Tacit Knowledge Gap

The study highlights a significant divide in MLLM capabilities: they perform well on routine arithmetic but struggle with mathematical illustrations. This struggle, even when visual challenges are removed, points to a deeper issue: current MLLMs appear to lack the ‘tacit knowledge’ that experienced educators possess. This includes an understanding of how mathematical representations evolve in children’s thinking, awareness of classroom-specific methods, and the ability to recognize sophisticated reasoning in imprecise drawings. An experienced teacher can interpret a hastily drawn number line or an unconventional diagram to understand a student’s deep understanding, something MLLMs currently cannot replicate.

Implications for Educational Technology

These findings have important implications for designing AI-powered educational tools. For routine arithmetic, MLLMs could be valuable for automated data collection, identifying struggling students, and tracking trends. This could enable more frequent formative assessments without increasing teacher workload. However, for complex mathematical illustrations, a ‘human-in-the-loop’ approach is essential. MLLMs could act as intelligent filters, processing large volumes of work and flagging cases that require expert human interpretation.

Future systems should prioritize transparency, showing educators not just grades but also how the model arrived at its conclusions, allowing teachers to quickly identify and correct errors. Professional development will also be crucial to help educators understand both the capabilities and limitations of these tools, ensuring they enhance rather than replace pedagogical expertise. The full research paper can be found here.

Also Read:

Conclusion

This research provides a comprehensive view of MLLMs’ potential and limitations in interpreting handwritten student mathematical work. While promising for objective arithmetic tasks, current models still face significant challenges with the nuanced interpretation required for mathematical illustrations. Bridging this gap will require continued research into integrating visual and conceptual reasoning, as well as efforts to computationally capture the invaluable tacit knowledge of expert educators. The ultimate goal is to create hybrid systems that amplify educator expertise through computational methods for routine analysis, while preserving human insight for the complex interpretive work essential to understanding student learning.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating Multimodal AI for Grading Handwritten Student Math Work

Experiment A: Assessing Numerical Calculations

Experiment B: Interpreting Mathematical Illustrations

The Tacit Knowledge Gap

Implications for Educational Technology

Conclusion

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Geninfinity Education Honored with 2025 Global Recognition Award for Pioneering AI-Powered Decentralized Learning

Artificial Intelligence Revolutionizes Educator Development and Personalized Learning, New Studies Reveal

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates