TLDR: Leanabell-Prover-V2 is a new 7B large language model that significantly enhances formal theorem proving in Lean 4. It achieves this by integrating reinforcement learning with direct feedback from the Lean 4 verifier, allowing the model to identify and correct its own errors. This verifier-integrated approach leads to notable performance improvements on various mathematical benchmarks, particularly for smaller LLMs, making formal reasoning more accessible and reliable.
Formal theorem proving is a powerful method for ensuring the correctness of complex mathematical theorems, offering a level of certainty that traditional testing cannot match. Tools like Coq, Isabelle/HOL, and Lean are fundamental in this field, allowing mathematicians and engineers to create proofs that can be checked by machines. However, the process of formalizing and constructing these proofs often demands significant expertise and manual effort, which limits their broader adoption.
The emergence of Large Language Models (LLMs) with their reasoning capabilities has transformed many areas, including informal mathematical reasoning. Building on previous work, a new model called Leanabell-Prover-V2 has been introduced. This 7-billion-parameter LLM is designed to produce formal theorem proofs in Lean 4, incorporating a novel approach called verifier-integrated Long Chain-of-Thoughts (CoT).
How Leanabell-Prover-V2 Works
The core innovation in Leanabell-Prover-V2 lies in its upgraded Reinforcement Learning (RL) framework, which uses direct feedback from the Lean 4 verifier. This verifier feedback, whether indicating success or detailing specific errors, is crucial. It allows the LLM to become “self-aware” of the correctness of its own reasoning process and learn to correct errors reflexively. The model directly optimizes its reasoning pathways through multiple interactions with the verifier. To ensure stable RL training, a technique called feedback token masking is used, along with a simple reward strategy.
The development of Leanabell-Prover-V2 involved a “cold-start” training phase to equip the model with the ability to generate formatted CoT outputs for verifier interaction. This involved four main steps: generating initial proofs using existing prover LLMs, validating these proofs with the Lean 4 verifier, automatically correcting incorrect proofs using another advanced AI model (Claude-3.7-Sonnet) based on error logs, and then re-verifying the corrected proofs. This process created a high-quality dataset of “incorrect-corrected” pairs and “correct” data with synthesized long CoTs.
In the RL training phase, the model interacts with the Lean 4 environment. It generates a long CoT reasoning process, including an initial proof. If the proof compiles successfully, it’s considered correct. If it fails, the compilation error information is fed back to the model as a “reflection prompt,” prompting a new round of regeneration. Rewards are given based on compilation success or failure, guiding the model to learn effective proof generation and correction strategies.
Performance and Impact
Experiments show that Leanabell-Prover-V2 significantly improves performance. On the MiniF2F test set, it boosted performance by 3.2% (pass@128) when starting from Kimina-Prover-Preview-Distill-7B and 2.0% (pass@128) when starting from DeepSeek-Prover-V2-7B. This demonstrates that the verifier-integrated long CoT reasoning greatly enhances the model’s capabilities in formal theorem proving.
The model also showed strong generalization abilities, improving by 6.4% over the Kimina-Prover-Preview-Distill-7B baseline on college-level theorem proving problems in the ProofNet-test set. While it didn’t surpass DeepSeek-Prover-V2-7B’s overall performance on ProverBench, it successfully solved an additional challenging AIME 24&25 problem, indicating its potential for complex mathematical reasoning tasks.
The research highlights that integrating instant verification feedback within a multi-turn interaction framework is a promising approach for improving the reliability of formal proofs generated by smaller LLMs. This work marks a significant step towards making formal reasoning more accessible and dependable within language models. The source code, curated data, and models are openly available for the research community.
Also Read:
- INoT: Empowering AI Agents with Internal Reflection for Enhanced Performance and Efficiency
- elsciRL: A New Open-Source Library for Language Integration in Reinforcement Learning
Looking Ahead
Despite its successes, the authors acknowledge limitations, particularly that the benefits may diminish for very strong base models or when dealing with extremely difficult problems due to limited active training data. Future work aims to explore subgoal decomposition via RL, partial rollout strategies for harder problems, and more fine-grained reward designs based on detailed verification feedback. Additionally, optimizing context pruning strategies in the iterative reasoning cycles could further enhance efficiency and reflective performance.
For more in-depth information, you can read the full research paper here: Leanabell-Prover-V2: Verifier-integrated Reasoning for Formal Theorem Proving via Reinforcement Learning.


