AI Models Advance Mathematical Proofs and Discoveries

TLDR: A new research paper introduces a protocol using GPT-5 models as collaborative provers and verifiers for mathematical theorems. This system, which includes a crucial human verification step, successfully solved five out of six 2025 International Mathematical Olympiad problems, closed 16 previously open conjectures in number theory, and discovered new theorems in machine learning, graph theory, and sphere packing. The methodology addresses LLM hallucinations by ensuring formal proof correctness through human-certified semantic equivalence.

A recent research paper titled “Mathematics with large language models as provers and verifiers” by Hieu Le Duc and Leo Liberti explores a novel approach to mathematical theorem proving using advanced artificial intelligence. The paper details a protocol that leverages large language models (LLMs) to tackle complex mathematical problems, including those from the International Mathematical Olympiad (IMO) and open conjectures in number theory.

The core of their methodology involves a sophisticated interaction between different instances of the GPT-5 model, acting as collaborative ‘provers’ and ‘verifiers’. This system, designed with minimal human intervention, aims to overcome the inherent challenges of LLMs, such as the tendency to ‘hallucinate’ or produce incorrect information. The protocol, termed Test-Time Verify-Revise (TTVR), mimics the human process of problem-solving: proposing a solution, checking it, correcting it, and repeating until a valid proof is found.

In this TTVR loop, a ‘prover’ LLM proposes a proof, which is then evaluated by a ‘verifier’ LLM. If the proof is rejected, the verifier provides feedback on the issues, which the prover uses to revise its attempt. This iterative process continues for a set number of times. Once a proof is accepted by the verifier, it is formalized into a proof assistant language, specifically Lean. A crucial final step involves a human verifying that the formal statement of the theorem (premises and conclusion) aligns perfectly with its natural language description. This human check, though minimal, is vital for guaranteeing the correctness of the formal proof, even if the LLM might have experienced hallucinations during its generation process.

The researchers highlight that this protocol ensures the validity of the formal proof, even if the natural language proof or the initial formalization contained errors due to LLM hallucinations. The human’s role is to bridge the semantic gap between natural language and formal logic, ensuring that what the LLM *thinks* it proved matches the actual mathematical statement.

Also Read:

Impressive Achievements

The system demonstrated remarkable capabilities across various mathematical domains. In the 2025 International Mathematical Olympiad, the protocol successfully solved five out of six problems, matching the performance of specialized LLMs developed by major AI labs. This is particularly noteworthy as the GPT-5 model used was trained on data only up to 2024, meaning it had no prior exposure to the 2025 IMO problems.

Beyond competitive problems, the system also made significant strides in number theory. It closed twenty-two out of sixty-six open conjectures on cyclic numbers from a recent paper, with sixteen of these being previously unsettled. The paper provides an example of a complex proof for the “Third Firoozbakht conjecture on cyclics” that the system managed to derive.

Furthermore, the research introduced an advanced “research mode” where the system, given a broad mathematical goal, could discover and prove new theorems. This mode involves additional LLM agents like a literature reviewer, context preparer, and predictor. In this mode, the system explored areas such as improving gradient descent in machine learning, the 1-2-3 conjecture in graph theory, and sphere packing in Euclidean spaces. It successfully settled several conjectures in these fields, proving some and refuting others, showcasing its ability to contribute to active research areas.

The paper also delves into the practical aspects of implementing such a system, discussing the use of OpenAI’s API and the careful crafting of prompts to define the behavior of the prover and verifier agents. It acknowledges challenges such as the probabilistic nature of LLM outputs, which affects reproducibility, and the time-consuming nature of human verification for converting natural language proofs into formal Lean code. The authors suggest that more intensive LLM training on formal code could alleviate this bottleneck.

This work represents a significant step forward in the application of large language models to formal mathematics, demonstrating their potential not only as problem-solvers but also as tools for discovering and verifying new mathematical knowledge. For more details, you can read the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI Models Advance Mathematical Proofs and Discoveries

Impressive Achievements

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates