Unveiling Conjecturing as a Key Step in AI's Mathematical Reasoning

TLDR: A new research paper introduces ConjectureBench, a benchmark to evaluate how Large Language Models (LLMs) form mathematical conjectures, a critical step often overlooked in autoformalisation. It reveals that LLM autoformalisation performance is overestimated when conjectures are provided. The paper proposes LEAN-FIRE, a method that combines informal and formal reasoning, significantly improving LLMs’ ability to conjecture and achieve the first successful end-to-end autoformalisation of challenging PutnamBench problems where solutions were previously unknown.

In the world of artificial intelligence and mathematics, a significant challenge lies in teaching computers to understand and formalize complex mathematical statements. This process, known as autoformalisation, aims to translate informal mathematical language into a precise, formal language that can be verified by proof assistants like Lean. However, a recent research paper highlights a crucial, often overlooked step in this process: conjecturing.

The paper, titled “CONJECTURING: AN OVERLOOKED STEP IN INFORMAL MATHEMATICAL REASONING,” argues that many mathematical problems cannot be directly formalized without first forming a conjecture—a proposed conclusion, answer, or bound. Large Language Models (LLMs), while powerful, struggle with autoformalisation, and their ability to conjecture has been poorly understood and evaluated.

Introducing ConjectureBench and New Metrics

To address this gap, the researchers introduced a new benchmark called ConjectureBench. This benchmark augments existing datasets like PutnamBench and CombiBench, ensuring that problems require models to generate conjectures rather than having them provided. They also developed new metrics: ConJudge, which uses an LLM to assess if a conjecture is correctly incorporated into a formal statement, and equiv_rfl, which checks for definitional equivalence in standalone conjecture generation.

Their evaluation of foundational LLMs, including GPT-4.1 and DeepSeek-V3.1, revealed a significant finding: the performance of autoformalisation models is often overestimated when the conjecture is assumed to be given. When models have to figure out the conjecture themselves, their performance drops substantially. This suggests that LLMs possess the necessary mathematical knowledge, but struggle with the reasoning process required to form a conjecture independently.

Also Read:

LEAN-FIRE: Guiding AI’s Mathematical Intuition

To improve this, the team designed an innovative inference-time method called LEAN-FIRE (Lean Formal-Informal Reasoning). This approach guides LLMs by interleaving natural language Chain-of-Thought (CoT) reasoning with formal Lean-of-Thought (LoT) steps. By breaking down problems informally and then translating those steps into precise Lean primitives, LEAN-FIRE helps models connect informal reasoning with formal mathematics more effectively.

LEAN-FIRE demonstrated significant improvements in conjecturing and autoformalisation. Notably, it achieved the first successful end-to-end autoformalisation of 13 PutnamBench “no-answer” problems with GPT-4.1 and 7 with DeepSeek-V3.1, where solutions were previously withheld. This breakthrough indicates that while LLMs have the latent knowledge, they require structured guidance to effectively conjecture and formalize mathematical problems.

The research also highlighted that standalone conjecture generation remains a significant challenge, with models often producing auxiliary constructs instead of the actual conjecture. This further emphasizes that conjecturing needs to be treated as an independent task, requiring dedicated research into richer datasets and improved inference techniques. The authors provide forward-looking guidance, urging future research to focus on improving conjecturing as a distinct and critical step in formal mathematical reasoning. You can read the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unveiling Conjecturing as a Key Step in AI’s Mathematical Reasoning

Introducing ConjectureBench and New Metrics

LEAN-FIRE: Guiding AI’s Mathematical Intuition

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates