spot_img
HomeResearch & DevelopmentAI's Report Card: How Well Can Machines Grade Handwritten...

AI’s Report Card: How Well Can Machines Grade Handwritten Math Exams?

TLDR: A new benchmark, CHECK-MAT, evaluates Vision-Language Models (VLMs) on their ability to assess handwritten mathematical solutions from the Russian Unified State Exam (EGE). Unlike previous benchmarks, CHECK-MAT focuses on understanding student solutions, identifying mistakes, and assigning grades based on fixed criteria. The study tested seven VLMs, finding that OpenAI o4-mini performed best, but highlighted significant challenges in AI’s mathematical reasoning, particularly with geometry, and its alignment with human grading rubrics, indicating a substantial gap between current AI and expert human performance.

Artificial intelligence is making strides in many fields, and education is no exception. A new research paper, titled “CHECK-MAT: Checking Hand-Written Mathematical Answers for the Russian Unified State Exam,” introduces a groundbreaking benchmark designed to evaluate how well AI can grade handwritten math solutions. This work, led by Khrulev Ruslan from Lomonosov Moscow State University, addresses a critical gap in AI’s capabilities: understanding and assessing human-generated solutions rather than just solving problems.

Traditional AI benchmarks for mathematics often focus on whether a model can arrive at the correct answer to a problem. However, real-world educational assessment, especially for high-stakes exams like the Russian Unified State Exam (EGE), requires a deeper understanding. Expert teachers don’t just look at the final answer; they meticulously evaluate the entire problem-solving process, including intermediate steps, reasoning, and adherence to specific grading criteria. The CHECK-MAT benchmark aims to train and test Vision-Language Models (VLMs) on this complex task.

The EGE-Math Solutions Assessment Benchmark

The core of this research is a unique dataset compiled from 122 scanned solutions from the official EGE expert guide. Each entry includes an image of the student’s handwritten solution, the original problem statement, and the official grade assigned by human experts, along with detailed justifications. This rich dataset covers various mathematical topics like algebra, geometry, trigonometry, and calculus, presenting diverse challenges due to different handwriting styles and layouts.

The benchmark’s primary focus is on assessing the VLM’s ability to:

  • Understand the Solution Flow: Comprehend the logical progression of a student’s work.
  • Identify Errors: Pinpoint mathematical errors, logical flaws, or omissions.
  • Apply Grading Rubrics: Assess identified errors against specific EGE criteria to assign an appropriate score.

How the Models Were Evaluated

Seven state-of-the-art VLMs from major providers like Google, OpenAI, Arcee AI, and Alibaba Cloud were tested across three distinct evaluation modes:

  1. Without Answer: The model received only the handwritten solution image and the problem statement, relying solely on its internal understanding of the grading rubric.
  2. With Answer: The model was given the handwritten solution, problem statement, and the correct final numerical answer. This tested if external context improved error identification.
  3. With True Solution: The most informative mode, where the model received the handwritten solution, problem statement, and a complete, correct reference solution. This allowed evaluation of the model’s ability to compare student work with a gold standard.

Also Read:

Key Findings and Limitations

The evaluation revealed that OpenAI’s o4-mini model consistently performed the best across all modes, showing superior capabilities in understanding handwritten solutions and applying grading criteria. Google Gemini 2.0 Flash also performed strongly, especially when provided with additional context like the correct answer or a true solution.

However, the research also highlighted significant limitations. Models struggled more with geometry tasks (stereometry and planimetry) compared to algebraic ones, suggesting difficulties in interpreting free-hand diagrams and rigorous spatial reasoning. The highest accuracy achieved was 56.56%, indicating a substantial gap between current AI performance and human expert-level grading.

Several factors contribute to these limitations:

  • Visual Interpretation: Diverse handwriting styles and layouts pose challenges, leading to errors in initial visual recognition that propagate through the reasoning process.
  • Deep Reasoning: Models often struggle with complex symbolic and logical reasoning, especially for non-standard solution paths or subtle errors.
  • Dataset Size: The current benchmark uses 122 solutions; a larger, more diverse dataset could enable more comprehensive evaluation and fine-tuning.
  • Contextual Reasoning: While some models leverage additional context effectively, others struggle to integrate this information robustly.

This research paves the way for more sophisticated and human-centric AI assessment tools. The source code and dataset are available for further research and development, encouraging the community to build upon these findings. You can find more details about the paper at https://arxiv.org/pdf/2507.22958.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -