TLDR: A new study investigates multimodal large language models (MLLMs) for grading handwritten student math work. It finds that MLLMs achieve near-human accuracy (95%) on routine arithmetic problems with objective answers. However, they struggle significantly with mathematical illustrations, even when provided with human descriptions to overcome visual interpretation challenges, plateauing at moderate agreement levels (Kappa ≈ 0.47). This suggests MLLMs currently lack the ‘tacit knowledge’ and pedagogical judgment that human educators use to interpret nuanced student thinking, highlighting the need for hybrid human-AI systems in educational assessment.
Recent advancements in artificial intelligence, particularly with multimodal large language models (MLLMs), have opened up exciting possibilities for automating tasks that traditionally require human interpretation. One such area is the grading and analysis of handwritten student work, especially in subjects like elementary and middle-school mathematics where most assignments are still completed by hand. This capability could significantly reduce the time teachers spend on grading, allowing them to focus more on providing personalized feedback and understanding students’ learning processes.
A new study by Owen Henkel, Bill Roberts, Doug Jaffe, and Laurence Holt explores the effectiveness of MLLMs in interpreting and grading handwritten student mathematics. The researchers aimed to answer three key questions: how accurately can MLLMs assess handwritten arithmetic with objective answers, how does their performance change when evaluating mathematical illustrations, and can we distinguish between the models’ visual and pedagogical capabilities?
Experiment A: Assessing Numerical Calculations
The first experiment focused on evaluating MLLMs’ ability to grade routine arithmetic problems with clear, objective answers. The researchers used a dataset of 288 handwritten responses from middle school students in Ghana, involving fractions, percentages, and basic algebra. These problems required students to show their work and provide final answers. The assessment task was broken down into two parts: a ‘vision task’ (identifying the numerical answer written by the student) and a ‘grading task’ (determining if the identified answer was mathematically correct).
Four state-of-the-art MLLMs were tested: Claude 3.5 Sonnet, Claude 3.7, Gemini 2.5 Pro, and GPT-4.1. The results showed that Gemini 2.5 Pro significantly outperformed the other models, achieving an impressive 95% grading accuracy and a high agreement level (Kappa = 0.90) with human experts. This suggests that for straightforward arithmetic, MLLMs, especially the more advanced ones, can achieve near-human accuracy. Interestingly, some models even showed grading performance exceeding their visual interpretation accuracy, implying they might use mathematical context to compensate for imperfect handwriting recognition. However, the study also noted some puzzling errors, such as models penalizing correct final answers due to untidy or flawed intermediate steps, which human educators would be unlikely to do.
Experiment B: Interpreting Mathematical Illustrations
The second experiment presented a more complex challenge: evaluating mathematical illustrations and diagrams. This involved 150 student-drawn responses from American elementary students, featuring number lines, geometric shapes, and other visual representations where the drawing itself is the answer. Unlike arithmetic, interpreting these illustrations often requires pedagogical judgment, as students might use non-standard notation or partially correct approaches.
To understand the impact of visual interpretation challenges, two conditions were set up: ‘Model-Only’ (models viewed only the image) and ‘Human-Enhanced’ (models received high-quality human descriptions of the visual content alongside the image). When models had to interpret the images directly, their performance was considerably lower. However, when provided with human descriptions, all models showed substantial performance gains. Claude 3.7, for instance, saw a significant improvement in its agreement score (Kappa gain of +0.32). This indicates that visual interpretation is a major hurdle for MLLMs when dealing with complex student drawings.
Despite the improvements with human descriptions, the models’ performance plateaued at an agreement level (Kappa ≈ 0.43-0.47) similar to initial human inter-rater agreement before calibration. While this shows progress, it’s still considered insufficient for autonomous deployment in real-world educational settings.
The Tacit Knowledge Gap
The study highlights a significant divide in MLLM capabilities: they perform well on routine arithmetic but struggle with mathematical illustrations. This struggle, even when visual challenges are removed, points to a deeper issue: current MLLMs appear to lack the ‘tacit knowledge’ that experienced educators possess. This includes an understanding of how mathematical representations evolve in children’s thinking, awareness of classroom-specific methods, and the ability to recognize sophisticated reasoning in imprecise drawings. An experienced teacher can interpret a hastily drawn number line or an unconventional diagram to understand a student’s deep understanding, something MLLMs currently cannot replicate.
Implications for Educational Technology
These findings have important implications for designing AI-powered educational tools. For routine arithmetic, MLLMs could be valuable for automated data collection, identifying struggling students, and tracking trends. This could enable more frequent formative assessments without increasing teacher workload. However, for complex mathematical illustrations, a ‘human-in-the-loop’ approach is essential. MLLMs could act as intelligent filters, processing large volumes of work and flagging cases that require expert human interpretation.
Future systems should prioritize transparency, showing educators not just grades but also how the model arrived at its conclusions, allowing teachers to quickly identify and correct errors. Professional development will also be crucial to help educators understand both the capabilities and limitations of these tools, ensuring they enhance rather than replace pedagogical expertise. The full research paper can be found here.
Also Read:
- Unlocking How Large Language Models Perceive Problem Difficulty
- HARMO: A Hybrid Approach to Training Smarter Multimodal AI
Conclusion
This research provides a comprehensive view of MLLMs’ potential and limitations in interpreting handwritten student mathematical work. While promising for objective arithmetic tasks, current models still face significant challenges with the nuanced interpretation required for mathematical illustrations. Bridging this gap will require continued research into integrating visual and conceptual reasoning, as well as efforts to computationally capture the invaluable tacit knowledge of expert educators. The ultimate goal is to create hybrid systems that amplify educator expertise through computational methods for routine analysis, while preserving human insight for the complex interpretive work essential to understanding student learning.


