spot_img
HomeResearch & DevelopmentAutomated Proof Grading: Agentic Workflows Enhance Mathematical Competition Assessment

Automated Proof Grading: Agentic Workflows Enhance Mathematical Competition Assessment

TLDR: A research paper introduces Ref-Grader, an automated system using agentic workflows to grade mathematical competition proofs. It addresses the challenge of assigning partial credit by clustering reference solutions, matching them to student proofs, analyzing solution steps, automatically designing problem-specific rubrics, and then grading based on these rubrics. This multi-step approach significantly improves agreement with human grades and consistency in partial credit assignment compared to single-turn LLM grading.

Recent advancements in large language models (LLMs) have shown their increasing capability in solving complex mathematical problems, even reaching gold medal performance in competitions like the International Mathematical Olympiad (IMO) 2025. However, a significant challenge remains: reliably grading mathematical proofs, especially when it comes to assigning partial credit beyond a simple correct/incorrect judgment.

Traditional LLM-based grading often struggles with the nuances of partial credit, frequently overscoring incomplete solutions or failing to consistently apply rubrics. This issue is particularly critical as LLMs become more adept at generating proofs, necessitating robust automated assessment tools.

A new research paper, “RefGrader: Automated Grading of Mathematical Competition Proofs Using Agentic Workflows,” addresses this challenge by introducing an innovative approach to automated proof grading. Authored by Hamed Mahdavi, Pouria Mahdavinia, Samira Malek, Pegah Mohammadipour, Alireza Hashemi, Majid Daliri, Alireza Farhadi, Amir Khasahmadi, Niloofar Mireshghallah, and Vasant Honavar, the paper explores how agentic workflows can significantly improve the accuracy and consistency of grading mathematical competition proofs. You can find the full paper at arXiv:2510.09021.

The Problem with Single-Turn Grading

The researchers first evaluated LLMs in a “single-turn” grading setting, where the model is given a problem and a solution and asked to grade it. While these models could often distinguish between perfectly correct and incorrect solutions, they showed significant “calibration gaps” in assigning partial credit. They tended to overscore low-grade and partially correct solutions, indicating an optimistic bias and difficulty in assessing genuine progress in incomplete proofs.

Introducing Ref-Grader: A Multi-Step Agentic Workflow

To overcome these limitations, the paper proposes Ref-Grader, a multi-step agentic workflow that leverages reference solutions to enhance grading quality and calibration. This workflow is designed to mimic a human grader’s process more closely, breaking down the complex task into manageable, analytical steps:

1. Reference Solution Clustering: The system first groups available reference solutions based on their strategic similarities.

2. Solution Matching: It then identifies the most similar group of reference solutions to the student’s submitted proof.

3. Solution Analysis: The model analyzes the chosen reference solution, breaking it down into main steps (often called “aha moments” or key ideas) and their substeps.

4. Rubric Design: Crucially, the system automatically designs a problem-specific grading rubric. It allocates points among the main steps and defines rules for assigning points to substeps. The paper explores different rubric design choices, including approachability-based weighting (where harder steps get more points) and milestone-based rubrics.

5. Grading: Finally, the model detects errors in the student’s solution, either directly or by identifying contradictions with the reference solution. It then matches the correct and erroneous parts of the student’s work against the derived rubric to assign a final grade.

Improved Performance and Robustness

The evaluation of Ref-Grader was conducted using a corpus of 90 Gemini 2.5 Pro–generated solutions for IMO Shortlist problems (graded on a 1–4 scale) and MathArena solution sets for IMO/USAMO 2025 (scored on a 0–7 scale). The results demonstrated that the proposed agentic workflows substantially improved upon single-turn grading across various metrics, including Pearson/Spearman correlations, MAE/RMSE, QWK, and AC2. This indicates higher agreement with human grades and more consistent handling of partial credit.

The study also found that even simply adding a similar reference solution (a 3-step variant without explicit rubric generation) helped improve performance, but the full 5-step workflows with detailed solution analysis and rubric design yielded the most significant gains. The “Approachability” and “Milestones” based rubric designs performed particularly well.

Also Read:

Beyond Grading: Future Implications

The Ref-Grader system offers several broader applications beyond just automated grading. It can serve as an “LLM-as-a-judge” tool, providing transparent, step-referenced rationales for scores. It could also function as a generative reward model for reinforcement learning, guiding LLMs to produce more correct and complete proofs. Furthermore, in educational settings, this approach could grade student work and provide interpretable feedback on missing steps and error types, given appropriate reference solutions.

While these workflows might incur higher token consumption, many steps (like reference clustering, solution analysis, and rubric design) can be cached offline, making the overall cost manageable for online grading.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -