TLDR: A new benchmark, CHECK-MAT, evaluates Vision-Language Models (VLMs) on their ability to assess handwritten mathematical solutions from the Russian Unified State Exam (EGE). Unlike previous benchmarks, CHECK-MAT focuses on understanding student solutions, identifying mistakes, and assigning grades based on fixed criteria. The study tested seven VLMs, finding that OpenAI o4-mini performed best, but highlighted significant challenges in AI’s mathematical reasoning, particularly with geometry, and its alignment with human grading rubrics, indicating a substantial gap between current AI and expert human performance.
Artificial intelligence is making strides in many fields, and education is no exception. A new research paper, titled “CHECK-MAT: Checking Hand-Written Mathematical Answers for the Russian Unified State Exam,” introduces a groundbreaking benchmark designed to evaluate how well AI can grade handwritten math solutions. This work, led by Khrulev Ruslan from Lomonosov Moscow State University, addresses a critical gap in AI’s capabilities: understanding and assessing human-generated solutions rather than just solving problems.
Traditional AI benchmarks for mathematics often focus on whether a model can arrive at the correct answer to a problem. However, real-world educational assessment, especially for high-stakes exams like the Russian Unified State Exam (EGE), requires a deeper understanding. Expert teachers don’t just look at the final answer; they meticulously evaluate the entire problem-solving process, including intermediate steps, reasoning, and adherence to specific grading criteria. The CHECK-MAT benchmark aims to train and test Vision-Language Models (VLMs) on this complex task.
The EGE-Math Solutions Assessment Benchmark
The core of this research is a unique dataset compiled from 122 scanned solutions from the official EGE expert guide. Each entry includes an image of the student’s handwritten solution, the original problem statement, and the official grade assigned by human experts, along with detailed justifications. This rich dataset covers various mathematical topics like algebra, geometry, trigonometry, and calculus, presenting diverse challenges due to different handwriting styles and layouts.
The benchmark’s primary focus is on assessing the VLM’s ability to:
- Understand the Solution Flow: Comprehend the logical progression of a student’s work.
- Identify Errors: Pinpoint mathematical errors, logical flaws, or omissions.
- Apply Grading Rubrics: Assess identified errors against specific EGE criteria to assign an appropriate score.
How the Models Were Evaluated
Seven state-of-the-art VLMs from major providers like Google, OpenAI, Arcee AI, and Alibaba Cloud were tested across three distinct evaluation modes:
- Without Answer: The model received only the handwritten solution image and the problem statement, relying solely on its internal understanding of the grading rubric.
- With Answer: The model was given the handwritten solution, problem statement, and the correct final numerical answer. This tested if external context improved error identification.
- With True Solution: The most informative mode, where the model received the handwritten solution, problem statement, and a complete, correct reference solution. This allowed evaluation of the model’s ability to compare student work with a gold standard.
Also Read:
- VL-Cogito: Advancing Multimodal Reasoning Through Structured Learning
- Reassessing TROVE’s Performance in Mathematical Problem Solving
Key Findings and Limitations
The evaluation revealed that OpenAI’s o4-mini model consistently performed the best across all modes, showing superior capabilities in understanding handwritten solutions and applying grading criteria. Google Gemini 2.0 Flash also performed strongly, especially when provided with additional context like the correct answer or a true solution.
However, the research also highlighted significant limitations. Models struggled more with geometry tasks (stereometry and planimetry) compared to algebraic ones, suggesting difficulties in interpreting free-hand diagrams and rigorous spatial reasoning. The highest accuracy achieved was 56.56%, indicating a substantial gap between current AI performance and human expert-level grading.
Several factors contribute to these limitations:
- Visual Interpretation: Diverse handwriting styles and layouts pose challenges, leading to errors in initial visual recognition that propagate through the reasoning process.
- Deep Reasoning: Models often struggle with complex symbolic and logical reasoning, especially for non-standard solution paths or subtle errors.
- Dataset Size: The current benchmark uses 122 solutions; a larger, more diverse dataset could enable more comprehensive evaluation and fine-tuning.
- Contextual Reasoning: While some models leverage additional context effectively, others struggle to integrate this information robustly.
This research paves the way for more sophisticated and human-centric AI assessment tools. The source code and dataset are available for further research and development, encouraging the community to build upon these findings. You can find more details about the paper at https://arxiv.org/pdf/2507.22958.


