AI's Report Card: How Well Can Machines Grade Handwritten Math Exams?

TLDR: A new benchmark, CHECK-MAT, evaluates Vision-Language Models (VLMs) on their ability to assess handwritten mathematical solutions from the Russian Unified State Exam (EGE). Unlike previous benchmarks, CHECK-MAT focuses on understanding student solutions, identifying mistakes, and assigning grades based on fixed criteria. The study tested seven VLMs, finding that OpenAI o4-mini performed best, but highlighted significant challenges in AI’s mathematical reasoning, particularly with geometry, and its alignment with human grading rubrics, indicating a substantial gap between current AI and expert human performance.

Artificial intelligence is making strides in many fields, and education is no exception. A new research paper, titled “CHECK-MAT: Checking Hand-Written Mathematical Answers for the Russian Unified State Exam,” introduces a groundbreaking benchmark designed to evaluate how well AI can grade handwritten math solutions. This work, led by Khrulev Ruslan from Lomonosov Moscow State University, addresses a critical gap in AI’s capabilities: understanding and assessing human-generated solutions rather than just solving problems.

Traditional AI benchmarks for mathematics often focus on whether a model can arrive at the correct answer to a problem. However, real-world educational assessment, especially for high-stakes exams like the Russian Unified State Exam (EGE), requires a deeper understanding. Expert teachers don’t just look at the final answer; they meticulously evaluate the entire problem-solving process, including intermediate steps, reasoning, and adherence to specific grading criteria. The CHECK-MAT benchmark aims to train and test Vision-Language Models (VLMs) on this complex task.

The EGE-Math Solutions Assessment Benchmark

The core of this research is a unique dataset compiled from 122 scanned solutions from the official EGE expert guide. Each entry includes an image of the student’s handwritten solution, the original problem statement, and the official grade assigned by human experts, along with detailed justifications. This rich dataset covers various mathematical topics like algebra, geometry, trigonometry, and calculus, presenting diverse challenges due to different handwriting styles and layouts.

The benchmark’s primary focus is on assessing the VLM’s ability to:

Understand the Solution Flow: Comprehend the logical progression of a student’s work.
Identify Errors: Pinpoint mathematical errors, logical flaws, or omissions.
Apply Grading Rubrics: Assess identified errors against specific EGE criteria to assign an appropriate score.

How the Models Were Evaluated

Seven state-of-the-art VLMs from major providers like Google, OpenAI, Arcee AI, and Alibaba Cloud were tested across three distinct evaluation modes:

Without Answer: The model received only the handwritten solution image and the problem statement, relying solely on its internal understanding of the grading rubric.
With Answer: The model was given the handwritten solution, problem statement, and the correct final numerical answer. This tested if external context improved error identification.
With True Solution: The most informative mode, where the model received the handwritten solution, problem statement, and a complete, correct reference solution. This allowed evaluation of the model’s ability to compare student work with a gold standard.

Also Read:

Key Findings and Limitations

The evaluation revealed that OpenAI’s o4-mini model consistently performed the best across all modes, showing superior capabilities in understanding handwritten solutions and applying grading criteria. Google Gemini 2.0 Flash also performed strongly, especially when provided with additional context like the correct answer or a true solution.

However, the research also highlighted significant limitations. Models struggled more with geometry tasks (stereometry and planimetry) compared to algebraic ones, suggesting difficulties in interpreting free-hand diagrams and rigorous spatial reasoning. The highest accuracy achieved was 56.56%, indicating a substantial gap between current AI performance and human expert-level grading.

Several factors contribute to these limitations:

Visual Interpretation: Diverse handwriting styles and layouts pose challenges, leading to errors in initial visual recognition that propagate through the reasoning process.
Deep Reasoning: Models often struggle with complex symbolic and logical reasoning, especially for non-standard solution paths or subtle errors.
Dataset Size: The current benchmark uses 122 solutions; a larger, more diverse dataset could enable more comprehensive evaluation and fine-tuning.
Contextual Reasoning: While some models leverage additional context effectively, others struggle to integrate this information robustly.

This research paves the way for more sophisticated and human-centric AI assessment tools. The source code and dataset are available for further research and development, encouraging the community to build upon these findings. You can find more details about the paper at https://arxiv.org/pdf/2507.22958.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI’s Report Card: How Well Can Machines Grade Handwritten Math Exams?

The EGE-Math Solutions Assessment Benchmark

How the Models Were Evaluated

Key Findings and Limitations

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Geninfinity Education Honored with 2025 Global Recognition Award for Pioneering AI-Powered Decentralized Learning

Artificial Intelligence Revolutionizes Educator Development and Personalized Learning, New Studies Reveal

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates