TLDR: FormalML is a new Lean 4 benchmark with 4,937 problems from machine learning theory (optimization, probability) designed to evaluate large language models (LLMs) on “subgoal completion”—filling in missing steps in complex mathematical proofs. It highlights current LLM limitations in accuracy, efficiency, and premise retrieval in research-level contexts, but shows promise for expert iteration training.
Large language models, or LLMs, have shown impressive capabilities in formal theorem proving, even achieving results comparable to human performance in some mathematical competitions. However, their role as practical assistants for mathematicians, particularly in filling in the missing steps within complex proofs—a task known as “subgoal completion”—has remained largely unexplored.
To address this gap, a new research paper introduces FormalML, a benchmark specifically designed to evaluate LLMs on this critical task. Built using Lean 4, a popular proof assistant, FormalML focuses on foundational theories in machine learning, including optimization and probability inequalities. The benchmark comprises 4,937 problems, each representing a short but non-trivial proof obligation that an LLM must complete.
What is Subgoal Completion?
Imagine a mathematician outlining the main steps of a complex proof. While the high-level reasoning might seem complete, formalizing it in a system like Lean 4 often reveals “holes” or missing technical details that the automated prover cannot bridge on its own. Subgoal completion is precisely about an AI system stepping in to fill these gaps, aligning the informal human sketch with the rigorous formal proof.
How FormalML Was Created
The creators of FormalML developed a unique “translation tactic” within Lean 4. This tactic allows them to extract subgoal problems from existing procedural-style proofs by recording the proof state before and after a specific reasoning step. This innovative approach generates problems of varying difficulty, making the benchmark comprehensive.
The dataset itself is derived from two Lean 4 libraries: Optlib, which formalizes various optimization algorithms like gradient descent, and FoML, which focuses on generalization error bounds and probability inequalities. By leveraging these established libraries, FormalML provides a rich and relevant context for evaluating AI provers.
Key Challenges for AI Provers
The research highlights several persistent challenges for current LLM-based theorem provers when tackling subgoal completion:
- Complex Proof Context: LLMs struggle to fully grasp all accumulated hypotheses and intricate goals within lengthy proofs.
- Premise Retrieval: Accurately finding and utilizing relevant theorems and lemmas from vast mathematical libraries remains a significant hurdle.
- Efficiency and “Overthinking”: Models trained for highly complex, competition-style problems sometimes “overthink” simpler subgoal tasks, leading to verbose and inefficient solutions.
Evaluation and Findings
The paper presents a systematic evaluation of state-of-the-art LLM-based provers on FormalML. The results indicate that while models like DeepSeek-Prover-V2 show improved retrieval capabilities, their overall performance sharply declines on more difficult problems. Interestingly, “chain-of-thought” prompting, often effective in natural language reasoning, did not improve proof completion in this context and even reduced efficiency.
A promising finding was the effectiveness of “expert iteration,” a training approach that significantly enhanced performance, especially for simpler problems. This suggests that continuous refinement and specialized training can improve AI’s ability to assist in formal mathematics.
Also Read:
- How Well Do LLMs Tutor? A New Benchmark Reveals Strengths and Weaknesses
- New Research Reveals Critical Vulnerabilities in AI Model Contamination Detection
Looking Ahead
FormalML represents a crucial step forward in developing AI systems that can genuinely act as copilots for mathematicians. By focusing on the practical task of subgoal completion within research-level contexts, this benchmark aims to accelerate the development of more accurate, efficient, and context-aware LLM-based theorem provers, ultimately easing the burden on human reviewers and advancing scientific discovery in machine learning and beyond.


