TLDR: Hard2Verify is a new, human-annotated benchmark designed to rigorously assess the ability of AI models to verify step-level mathematical proofs, especially for difficult, open-ended problems. Created with over 500 hours of expert labor, it evaluates verifiers on responses from frontier LLMs. The research found that proprietary models generally outperform open-source ones, and weaker verifiers struggle to identify errors. It also suggests that verifying a solution is often easier for LLMs than generating one, offering optimism for future AI development in mathematical reasoning.
In the rapidly evolving world of artificial intelligence, large language models (LLMs) are achieving remarkable feats, even reaching gold medal-level performance in prestigious math competitions like the IMO 2025. However, for these advanced AI systems to truly excel in complex, open-ended mathematical reasoning, they need equally sophisticated tools to verify their work, step by step. This is where a new benchmark called Hard2Verify comes into play, aiming to rigorously test the capabilities of these AI verifiers.
Developed by Salesforce AI Research, Hard2Verify is a meticulously human-annotated benchmark designed to assess how well LLMs can identify errors in mathematical proofs generated by other frontier LLMs. The creation of this benchmark was a monumental effort, requiring over 500 hours of human labor from PhD-level math experts to annotate each step of model-generated solutions.
What makes Hard2Verify unique and particularly challenging? Firstly, it focuses on extremely difficult, open-ended math questions sourced from recent international competitions like the IMO and Putnam. Unlike simpler problems with single, easily verifiable answers, open-ended problems demand a deep understanding and rigorous step-by-step validation, making it harder for verifiers to ‘cheat’ by simply knowing the final answer.
Secondly, the responses evaluated in Hard2Verify are not artificially constructed or error-injected. Instead, they are naturally occurring outputs from highly capable, frontier-level LLMs such as GPT-5 (high), Gemini 2.5 Pro, and Claude Sonnet 4 (thinking). This ensures that the benchmark accurately reflects the types of mistakes verifiers would encounter in real-world applications.
Thirdly, the annotation process employs a strict grading philosophy: any step containing a mistake, or even derived from a previous incorrect step, is marked as incorrect. This mirrors the stringent standards of competitive mathematics, where an entire solution must be flawless to receive full credit.
The research evaluated 29 different generative critics and process reward models on Hard2Verify across three key tasks: step-level correctness, response-level correctness, and first error identification. The findings revealed a significant gap: proprietary models like GPT-5 and Gemini 2.5 Pro generally outperformed open-source verifiers. A striking observation was that weaker verifiers often struggled to identify errors, frequently marking almost every step as correct, indicating a fundamental inability to catch subtle mistakes.
The study also explored how to optimize verifier performance, finding that allowing models more ‘thinking’ tokens during sequential inference significantly improved results, whereas parallel decoding (sampling multiple outputs simultaneously) had little impact. This suggests that deep, sequential inspection is more effective for verification than parallel, rushed judgments.
Intriguingly, the research also touched upon the dynamics of self-verification and the fundamental question of whether verifying a problem is easier than solving it. The results indicated that, generally, LLMs are more successful at catching mistakes in a solution than at generating an entirely error-free solution themselves. This offers a hopeful outlook for the future of AI in mathematics, suggesting that verifiers may not need to be as powerful as the generators to reliably identify errors.
Also Read:
- A New Math Benchmark Challenges AI’s Reasoning Boundaries
- Unmasking AI’s Reasoning Gaps: A New Benchmark for Logic Puzzles
Hard2Verify represents a crucial step forward in developing more reliable and robust AI systems for complex mathematical reasoning. By providing a challenging and meticulously annotated benchmark, it pushes the frontier of what’s possible in AI verification. You can find more details about this research paper here: Hard2Verify Research Paper.


