FormalML: A New Benchmark to Bridge AI and Advanced Mathematical Proofs

TLDR: FormalML is a new Lean 4 benchmark with 4,937 problems from machine learning theory (optimization, probability) designed to evaluate large language models (LLMs) on “subgoal completion”—filling in missing steps in complex mathematical proofs. It highlights current LLM limitations in accuracy, efficiency, and premise retrieval in research-level contexts, but shows promise for expert iteration training.

Large language models, or LLMs, have shown impressive capabilities in formal theorem proving, even achieving results comparable to human performance in some mathematical competitions. However, their role as practical assistants for mathematicians, particularly in filling in the missing steps within complex proofs—a task known as “subgoal completion”—has remained largely unexplored.

To address this gap, a new research paper introduces FormalML, a benchmark specifically designed to evaluate LLMs on this critical task. Built using Lean 4, a popular proof assistant, FormalML focuses on foundational theories in machine learning, including optimization and probability inequalities. The benchmark comprises 4,937 problems, each representing a short but non-trivial proof obligation that an LLM must complete.

What is Subgoal Completion?

Imagine a mathematician outlining the main steps of a complex proof. While the high-level reasoning might seem complete, formalizing it in a system like Lean 4 often reveals “holes” or missing technical details that the automated prover cannot bridge on its own. Subgoal completion is precisely about an AI system stepping in to fill these gaps, aligning the informal human sketch with the rigorous formal proof.

How FormalML Was Created

The creators of FormalML developed a unique “translation tactic” within Lean 4. This tactic allows them to extract subgoal problems from existing procedural-style proofs by recording the proof state before and after a specific reasoning step. This innovative approach generates problems of varying difficulty, making the benchmark comprehensive.

The dataset itself is derived from two Lean 4 libraries: Optlib, which formalizes various optimization algorithms like gradient descent, and FoML, which focuses on generalization error bounds and probability inequalities. By leveraging these established libraries, FormalML provides a rich and relevant context for evaluating AI provers.

Key Challenges for AI Provers

The research highlights several persistent challenges for current LLM-based theorem provers when tackling subgoal completion:

Complex Proof Context: LLMs struggle to fully grasp all accumulated hypotheses and intricate goals within lengthy proofs.
Premise Retrieval: Accurately finding and utilizing relevant theorems and lemmas from vast mathematical libraries remains a significant hurdle.
Efficiency and “Overthinking”: Models trained for highly complex, competition-style problems sometimes “overthink” simpler subgoal tasks, leading to verbose and inefficient solutions.

Evaluation and Findings

The paper presents a systematic evaluation of state-of-the-art LLM-based provers on FormalML. The results indicate that while models like DeepSeek-Prover-V2 show improved retrieval capabilities, their overall performance sharply declines on more difficult problems. Interestingly, “chain-of-thought” prompting, often effective in natural language reasoning, did not improve proof completion in this context and even reduced efficiency.

A promising finding was the effectiveness of “expert iteration,” a training approach that significantly enhanced performance, especially for simpler problems. This suggests that continuous refinement and specialized training can improve AI’s ability to assist in formal mathematics.

Also Read:

Looking Ahead

FormalML represents a crucial step forward in developing AI systems that can genuinely act as copilots for mathematicians. By focusing on the practical task of subgoal completion within research-level contexts, this benchmark aims to accelerate the development of more accurate, efficient, and context-aware LLM-based theorem provers, ultimately easing the burden on human reviewers and advancing scientific discovery in machine learning and beyond.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

FormalML: A New Benchmark to Bridge AI and Advanced Mathematical Proofs

What is Subgoal Completion?

How FormalML Was Created

Key Challenges for AI Provers

Evaluation and Findings

Looking Ahead

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates