Automated Proof Grading: Agentic Workflows Enhance Mathematical Competition Assessment

TLDR: A research paper introduces Ref-Grader, an automated system using agentic workflows to grade mathematical competition proofs. It addresses the challenge of assigning partial credit by clustering reference solutions, matching them to student proofs, analyzing solution steps, automatically designing problem-specific rubrics, and then grading based on these rubrics. This multi-step approach significantly improves agreement with human grades and consistency in partial credit assignment compared to single-turn LLM grading.

Recent advancements in large language models (LLMs) have shown their increasing capability in solving complex mathematical problems, even reaching gold medal performance in competitions like the International Mathematical Olympiad (IMO) 2025. However, a significant challenge remains: reliably grading mathematical proofs, especially when it comes to assigning partial credit beyond a simple correct/incorrect judgment.

Traditional LLM-based grading often struggles with the nuances of partial credit, frequently overscoring incomplete solutions or failing to consistently apply rubrics. This issue is particularly critical as LLMs become more adept at generating proofs, necessitating robust automated assessment tools.

A new research paper, “RefGrader: Automated Grading of Mathematical Competition Proofs Using Agentic Workflows,” addresses this challenge by introducing an innovative approach to automated proof grading. Authored by Hamed Mahdavi, Pouria Mahdavinia, Samira Malek, Pegah Mohammadipour, Alireza Hashemi, Majid Daliri, Alireza Farhadi, Amir Khasahmadi, Niloofar Mireshghallah, and Vasant Honavar, the paper explores how agentic workflows can significantly improve the accuracy and consistency of grading mathematical competition proofs. You can find the full paper at arXiv:2510.09021.

The Problem with Single-Turn Grading

The researchers first evaluated LLMs in a “single-turn” grading setting, where the model is given a problem and a solution and asked to grade it. While these models could often distinguish between perfectly correct and incorrect solutions, they showed significant “calibration gaps” in assigning partial credit. They tended to overscore low-grade and partially correct solutions, indicating an optimistic bias and difficulty in assessing genuine progress in incomplete proofs.

Introducing Ref-Grader: A Multi-Step Agentic Workflow

To overcome these limitations, the paper proposes Ref-Grader, a multi-step agentic workflow that leverages reference solutions to enhance grading quality and calibration. This workflow is designed to mimic a human grader’s process more closely, breaking down the complex task into manageable, analytical steps:

1. Reference Solution Clustering: The system first groups available reference solutions based on their strategic similarities.

2. Solution Matching: It then identifies the most similar group of reference solutions to the student’s submitted proof.

3. Solution Analysis: The model analyzes the chosen reference solution, breaking it down into main steps (often called “aha moments” or key ideas) and their substeps.

4. Rubric Design: Crucially, the system automatically designs a problem-specific grading rubric. It allocates points among the main steps and defines rules for assigning points to substeps. The paper explores different rubric design choices, including approachability-based weighting (where harder steps get more points) and milestone-based rubrics.

5. Grading: Finally, the model detects errors in the student’s solution, either directly or by identifying contradictions with the reference solution. It then matches the correct and erroneous parts of the student’s work against the derived rubric to assign a final grade.

Improved Performance and Robustness

The evaluation of Ref-Grader was conducted using a corpus of 90 Gemini 2.5 Pro–generated solutions for IMO Shortlist problems (graded on a 1–4 scale) and MathArena solution sets for IMO/USAMO 2025 (scored on a 0–7 scale). The results demonstrated that the proposed agentic workflows substantially improved upon single-turn grading across various metrics, including Pearson/Spearman correlations, MAE/RMSE, QWK, and AC2. This indicates higher agreement with human grades and more consistent handling of partial credit.

The study also found that even simply adding a similar reference solution (a 3-step variant without explicit rubric generation) helped improve performance, but the full 5-step workflows with detailed solution analysis and rubric design yielded the most significant gains. The “Approachability” and “Milestones” based rubric designs performed particularly well.

Also Read:

Beyond Grading: Future Implications

The Ref-Grader system offers several broader applications beyond just automated grading. It can serve as an “LLM-as-a-judge” tool, providing transparent, step-referenced rationales for scores. It could also function as a generative reward model for reinforcement learning, guiding LLMs to produce more correct and complete proofs. Furthermore, in educational settings, this approach could grade student work and provide interpretable feedback on missing steps and error types, given appropriate reference solutions.

While these workflows might incur higher token consumption, many steps (like reference clustering, solution analysis, and rubric design) can be cached offline, making the overall cost manageable for online grading.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Automated Proof Grading: Agentic Workflows Enhance Mathematical Competition Assessment

The Problem with Single-Turn Grading

Introducing Ref-Grader: A Multi-Step Agentic Workflow

Improved Performance and Robustness

Beyond Grading: Future Implications

Gen AI News and Updates

Ironclad Unveils Advanced AI Agents to Transform Contracts into Dynamic Assets

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Artificial Intelligence Revolutionizes Educator Development and Personalized Learning, New Studies Reveal

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates