Leanabell-Prover-V2: Advancing Formal Theorem Proving with AI Self-Correction

TLDR: Leanabell-Prover-V2 is a new 7B large language model that significantly enhances formal theorem proving in Lean 4. It achieves this by integrating reinforcement learning with direct feedback from the Lean 4 verifier, allowing the model to identify and correct its own errors. This verifier-integrated approach leads to notable performance improvements on various mathematical benchmarks, particularly for smaller LLMs, making formal reasoning more accessible and reliable.

Formal theorem proving is a powerful method for ensuring the correctness of complex mathematical theorems, offering a level of certainty that traditional testing cannot match. Tools like Coq, Isabelle/HOL, and Lean are fundamental in this field, allowing mathematicians and engineers to create proofs that can be checked by machines. However, the process of formalizing and constructing these proofs often demands significant expertise and manual effort, which limits their broader adoption.

The emergence of Large Language Models (LLMs) with their reasoning capabilities has transformed many areas, including informal mathematical reasoning. Building on previous work, a new model called Leanabell-Prover-V2 has been introduced. This 7-billion-parameter LLM is designed to produce formal theorem proofs in Lean 4, incorporating a novel approach called verifier-integrated Long Chain-of-Thoughts (CoT).

How Leanabell-Prover-V2 Works

The core innovation in Leanabell-Prover-V2 lies in its upgraded Reinforcement Learning (RL) framework, which uses direct feedback from the Lean 4 verifier. This verifier feedback, whether indicating success or detailing specific errors, is crucial. It allows the LLM to become “self-aware” of the correctness of its own reasoning process and learn to correct errors reflexively. The model directly optimizes its reasoning pathways through multiple interactions with the verifier. To ensure stable RL training, a technique called feedback token masking is used, along with a simple reward strategy.

The development of Leanabell-Prover-V2 involved a “cold-start” training phase to equip the model with the ability to generate formatted CoT outputs for verifier interaction. This involved four main steps: generating initial proofs using existing prover LLMs, validating these proofs with the Lean 4 verifier, automatically correcting incorrect proofs using another advanced AI model (Claude-3.7-Sonnet) based on error logs, and then re-verifying the corrected proofs. This process created a high-quality dataset of “incorrect-corrected” pairs and “correct” data with synthesized long CoTs.

In the RL training phase, the model interacts with the Lean 4 environment. It generates a long CoT reasoning process, including an initial proof. If the proof compiles successfully, it’s considered correct. If it fails, the compilation error information is fed back to the model as a “reflection prompt,” prompting a new round of regeneration. Rewards are given based on compilation success or failure, guiding the model to learn effective proof generation and correction strategies.

Performance and Impact

Experiments show that Leanabell-Prover-V2 significantly improves performance. On the MiniF2F test set, it boosted performance by 3.2% (pass@128) when starting from Kimina-Prover-Preview-Distill-7B and 2.0% (pass@128) when starting from DeepSeek-Prover-V2-7B. This demonstrates that the verifier-integrated long CoT reasoning greatly enhances the model’s capabilities in formal theorem proving.

The model also showed strong generalization abilities, improving by 6.4% over the Kimina-Prover-Preview-Distill-7B baseline on college-level theorem proving problems in the ProofNet-test set. While it didn’t surpass DeepSeek-Prover-V2-7B’s overall performance on ProverBench, it successfully solved an additional challenging AIME 24&25 problem, indicating its potential for complex mathematical reasoning tasks.

The research highlights that integrating instant verification feedback within a multi-turn interaction framework is a promising approach for improving the reliability of formal proofs generated by smaller LLMs. This work marks a significant step towards making formal reasoning more accessible and dependable within language models. The source code, curated data, and models are openly available for the research community.

Also Read:

Looking Ahead

Despite its successes, the authors acknowledge limitations, particularly that the benefits may diminish for very strong base models or when dealing with extremely difficult problems due to limited active training data. Future work aims to explore subgoal decomposition via RL, partial rollout strategies for harder problems, and more fine-grained reward designs based on detailed verification feedback. Additionally, optimizing context pruning strategies in the iterative reasoning cycles could further enhance efficiency and reflective performance.

For more in-depth information, you can read the full research paper here: Leanabell-Prover-V2: Verifier-integrated Reasoning for Formal Theorem Proving via Reinforcement Learning.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Leanabell-Prover-V2: Advancing Formal Theorem Proving with AI Self-Correction

How Leanabell-Prover-V2 Works

Performance and Impact

Looking Ahead

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates