New Benchmarks Reveal LLM Struggles in Formal Mathematical Proofs

TLDR: A new research paper introduces a scalable method to create challenging formal theorem proving benchmarks for large language models (LLMs) using theoretical computer science (TCS). By generating problems from Busy Beaver and Mixed Boolean Arithmetic domains, the framework provides rigorously aligned formal (Lean4) and informal (Markdown) proof challenges. Evaluations reveal significant limitations in current LLMs, particularly their struggle with long-form proof generation and over-reliance on automated tactics, despite good performance on individual reasoning steps.

A recent research paper, titled Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs, introduces a novel approach to evaluating the reasoning capabilities of large language models (LLMs) in formal theorem proving. Authored by Terry Jingchen Zhang, Wenyuan Jiang, Rongchuan Liu, Yisong Wang, Junran Yang, Ning Wang, Nicole Ni, Yinya Huang, and Mrinmaya Sachan, the paper addresses the critical need for more robust and scalable benchmarks in this rapidly evolving field.

Formal theorem proving (FTP) is becoming increasingly vital for assessing how well LLMs can reason, offering a way to automatically verify mathematical proofs step-by-step. However, progress has been hampered by a shortage of suitable datasets. Existing datasets are often limited in scope, expensive to create manually, and lack challenging problems that come with both verified formal and informal explanations.

Leveraging Theoretical Computer Science for New Challenges

The researchers propose a groundbreaking solution: using theoretical computer science (TCS) as a rich and scalable source of rigorous proof problems. TCS domains, with their precise algorithmic definitions, allow for the automated generation of an endless supply of challenging theorem-proof pairs. This method ensures that the problems are inherently rigorous and can be scaled to arbitrary complexity.

The paper demonstrates this approach using two specific TCS domains: Busy Beaver problems and Mixed Boolean Arithmetic (MBA) problems. Busy Beaver problems involve proving limits on how Turing machines halt, while MBA problems combine logical and arithmetic reasoning, often used in software obfuscation.

A Scalable Framework for Formal-Informal Pairs

The core of their work is a framework that automatically synthesizes these problems with parallel formal (Lean4) and informal (Markdown) specifications. This creates a streamlined pipeline for generating verified proof challenges that are resistant to data contamination, a common issue with benchmarks derived from public sources.

This framework offers fine-grained control over problem difficulty, enabling researchers to study how models generalize their reasoning across varying levels of complexity. It also helps eliminate the risk of models ‘hallucinating’ answers by grounding all problems in rigorous semantics validated by proof assistants.

Unveiling Gaps in Frontier Models

The evaluation of several advanced LLMs, including DeepSeekProver-V2-671B, Kimina-Prover-Preview, Goedel-Prover-SFT, and OpenAI-o4-mini, revealed significant limitations in their automated theorem proving abilities. For instance, DeepSeekProver-V2-671B achieved a 57.5% success rate on Busy Beaver problems but only 12% on Mixed Boolean Arithmetic problems. Smaller theorem provers often failed to solve even a single instance of these challenges.

These results highlight a substantial gap in the ability of LLMs to generate long-form proofs, even for problems that are computationally easy to verify. Paradoxically, some general-purpose LLMs, like QwQ-32B, showed surprisingly strong performance compared to dedicated theorem provers, raising questions about how scaling affects different types of reasoning models.

Understanding Failure Modes

The research identified several key failure modes:

Hallucination: Especially prevalent in smaller models, where they attempted to use non-existent theorems or tactics.
Tactic Misuse: Models often blindly applied automated proof tactics without fully understanding their applicability, leading to failed searches.
Type Mismatch: Fundamental syntactic errors in Lean, though less common, indicated a failure to correctly apply Lean’s type rules.
Voluntary Give-Up: Models sometimes simply left a ‘sorry’ placeholder, indicating a failure to even attempt a solution.

The Reasoning Gap: Step-Level vs. Holistic Proofs

A crucial finding was the significant discrepancy between step-level performance and the ability to construct a complete proof. Models demonstrated near-perfect accuracy (98.88% for OpenAI-o4-mini) in selecting appropriate lemmas for individual reasoning steps. However, they exhibited catastrophic failure rates when trying to compose these steps into a complete, coherent proof.

This suggests that the primary bottleneck for LLMs lies not in performing local operations but in global proof planning and strategic integration. The inability to bridge stepwise correctness with end-to-end proof generation indicates that current models struggle with strategic planning in long contexts, particularly when orchestrating atomic transformations into a coherent proof strategy.

Also Read:

Future Directions

The paper concludes by emphasizing the potential of theoretical computer science as a robust testbed for formal theorem proving. This new methodology offers infinite scalability and intrinsic resistance to data contamination, providing a sustainable evaluation ecosystem that can evolve with advancements in both AI and TCS. The authors call upon the Lean community to further explore this synergy to advance automated reasoning research.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

New Benchmarks Reveal LLM Struggles in Formal Mathematical Proofs

Leveraging Theoretical Computer Science for New Challenges

A Scalable Framework for Formal-Informal Pairs

Unveiling Gaps in Frontier Models

Understanding Failure Modes

The Reasoning Gap: Step-Level vs. Holistic Proofs

Future Directions

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

A New Way to Disentangle Data for Scientific Exploration

SiegPath Honored with ‘Most Innovative Fintech Award’ at AI Expo Europe 2025 for AI-Driven Solutions

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates