TLDR: A new benchmark series, FATE (FATE-H and FATE-X), has been introduced to evaluate large language models (LLMs) in advanced formal algebra, moving beyond contest-style math to PhD-level difficulty. Current LLMs show a significant performance gap, achieving only 3% on FATE-H and 0% on FATE-X. The main bottleneck is identified as the translation from natural language reasoning to formal code, with common errors including Mathlib hallucinations and Lean proficiency issues. General reasoning models sometimes outperform specialized provers due to better “effective reflection.” The research suggests decoupling natural language reasoning from formalization and improving meta-reasoning capabilities in AI.
Recent advancements in large language models (LLMs) have shown impressive capabilities in formal theorem proving, especially in contest-based mathematical benchmarks. However, these existing benchmarks often fall short in reflecting the complexity, breadth, and abstract nature of modern mathematical research.
To address this crucial gap, a new benchmark series called FATE (Formal Algebra Theorem Evaluation) has been introduced. This series aims to guide the development of AI towards advanced mathematical reasoning. The FATE series includes two new components: FATE-H (Formal Algebra Theorem Evaluation – Hard) and FATE-X (Formal Algebra Theorem Evaluation – Expert), each comprising 100 problems in abstract and commutative algebra. These benchmarks span a wide range of difficulty, from undergraduate exercises to problems that exceed PhD qualifying exams. Notably, FATE-X is the first formal benchmark to surpass both PhD-level exam difficulty and the coverage of the Mathlib library, a comprehensive mathematical library for the Lean proof assistant.
Evaluations of state-of-the-art LLM provers on the FATE benchmark reveal a significant performance disparity compared to their success in contest-level mathematics. The best-performing model achieved only 3% accuracy (pass@64) on FATE-H and a striking 0% on FATE-X. This stark drop in performance highlights the substantial challenges that remain in developing AI capable of research-level mathematical reasoning.
The research employed a two-stage evaluation process, mirroring how models typically operate: first generating a natural language Chain-of-Thought (CoT), and then formalizing it into Lean code. This analysis showed that models’ natural-language reasoning is considerably more accurate than their ability to formalize this reasoning. This suggests that the primary bottleneck is not necessarily the mathematical understanding itself, but rather the translation from informal reasoning to precise formal code.
A systematic classification of common errors during the formalization process identified Mathlib hallucinations (generating non-existent or incorrectly used Lean theorems/definitions) and Lean proficiency issues (lack of understanding of Lean’s syntax, type system, or idiomatic proof structures) as the most frequent problems. Misalignment, where the formal proof contradicts the natural language reasoning, was found to be remarkably infrequent, further supporting the idea that the core mathematical reasoning is often sound.
A comparative study also explored the differences between general reasoning models (like DeepSeek-R1) and specialized theorem provers (like DeepSeek-Prover-V2). It was found that specialized provers sometimes exhibit less effective “reflection”—the ability to locate, diagnose, and repair flaws in an argument—compared to general-purpose models. This can reduce their accuracy even at the natural-language reasoning stage. General models, while not perfect, showed a greater capacity for iterative correction and adapting their reasoning paths.
The findings suggest two key directions for future research in automated theorem proving. Firstly, an explicitly decoupled approach, where a natural language prover is developed separately from an autoformalizer, could lead to significant improvements. This acknowledges the functional decoupling observed between reasoning and formalization. Secondly, there is a critical challenge in designing training methodologies that can simultaneously leverage the precise reward signals from formalization while also fostering essential meta-reasoning capabilities like “effective reflection.”
Also Read:
- Rethinking Mathematical Benchmarks for AI: The miniF2F-v2 Approach
- Assessing How Well Large Language Models Understand Real-World Statistics
For more detailed information, you can read the full research paper here.


