TLDR: A new research paper introduces ConjectureBench, a benchmark to evaluate how Large Language Models (LLMs) form mathematical conjectures, a critical step often overlooked in autoformalisation. It reveals that LLM autoformalisation performance is overestimated when conjectures are provided. The paper proposes LEAN-FIRE, a method that combines informal and formal reasoning, significantly improving LLMs’ ability to conjecture and achieve the first successful end-to-end autoformalisation of challenging PutnamBench problems where solutions were previously unknown.
In the world of artificial intelligence and mathematics, a significant challenge lies in teaching computers to understand and formalize complex mathematical statements. This process, known as autoformalisation, aims to translate informal mathematical language into a precise, formal language that can be verified by proof assistants like Lean. However, a recent research paper highlights a crucial, often overlooked step in this process: conjecturing.
The paper, titled “CONJECTURING: AN OVERLOOKED STEP IN INFORMAL MATHEMATICAL REASONING,” argues that many mathematical problems cannot be directly formalized without first forming a conjecture—a proposed conclusion, answer, or bound. Large Language Models (LLMs), while powerful, struggle with autoformalisation, and their ability to conjecture has been poorly understood and evaluated.
Introducing ConjectureBench and New Metrics
To address this gap, the researchers introduced a new benchmark called ConjectureBench. This benchmark augments existing datasets like PutnamBench and CombiBench, ensuring that problems require models to generate conjectures rather than having them provided. They also developed new metrics: ConJudge, which uses an LLM to assess if a conjecture is correctly incorporated into a formal statement, and equiv_rfl, which checks for definitional equivalence in standalone conjecture generation.
Their evaluation of foundational LLMs, including GPT-4.1 and DeepSeek-V3.1, revealed a significant finding: the performance of autoformalisation models is often overestimated when the conjecture is assumed to be given. When models have to figure out the conjecture themselves, their performance drops substantially. This suggests that LLMs possess the necessary mathematical knowledge, but struggle with the reasoning process required to form a conjecture independently.
Also Read:
- Audit-of-Understanding: A New Framework for Reliable Mathematical Reasoning in Language Models
- A New Math Benchmark Challenges AI’s Reasoning Boundaries
LEAN-FIRE: Guiding AI’s Mathematical Intuition
To improve this, the team designed an innovative inference-time method called LEAN-FIRE (Lean Formal-Informal Reasoning). This approach guides LLMs by interleaving natural language Chain-of-Thought (CoT) reasoning with formal Lean-of-Thought (LoT) steps. By breaking down problems informally and then translating those steps into precise Lean primitives, LEAN-FIRE helps models connect informal reasoning with formal mathematics more effectively.
LEAN-FIRE demonstrated significant improvements in conjecturing and autoformalisation. Notably, it achieved the first successful end-to-end autoformalisation of 13 PutnamBench “no-answer” problems with GPT-4.1 and 7 with DeepSeek-V3.1, where solutions were previously withheld. This breakthrough indicates that while LLMs have the latent knowledge, they require structured guidance to effectively conjecture and formalize mathematical problems.
The research also highlighted that standalone conjecture generation remains a significant challenge, with models often producing auxiliary constructs instead of the actual conjecture. This further emphasizes that conjecturing needs to be treated as an independent task, requiring dedicated research into richer datasets and improved inference techniques. The authors provide forward-looking guidance, urging future research to focus on improving conjecturing as a distinct and critical step in formal mathematical reasoning. You can read the full paper here.


