Navigating the Mathematical Landscape: LLMs in Formal and Informal Reasoning

TLDR: This research paper explores the capabilities and limitations of Large Language Models (LLMs) in mathematical reasoning, distinguishing between informal problem-solving and formal theorem proving. It highlights why formal proving is significantly harder for LLMs than code generation, attributing it to the brittle nature of errors in formal logic compared to the more forgiving error landscape of programming. The paper also discusses whether LLMs truly possess a notion of computational state, suggesting they primarily rely on pattern recognition. It surveys current benchmarks and models, and proposes future directions such as adopting declarative proof styles and mixing reasoning approaches to enhance LLMs’ mathematical abilities.

Large Language Models (LLMs) have made significant strides in various tasks requiring structured reasoning, with their ability to generate and optimize code being a particularly notable success. This has naturally led to an exploration of their potential in the field of mathematics, encompassing both informal problem-solving and rigorous formal theorem proving. However, a key observation from recent research is that while LLMs excel in coding, progress in formal mathematics has been considerably more challenging, despite apparent similarities between writing code and constructing proofs.

A recent research paper, Thinking Machines: Mathematical Reasoning in the Age of LLMs, delves into this discrepancy, raising fundamental questions about how LLMs truly “reason,” how their learning is guided, and whether they genuinely track a notion of computational or deductive state. The authors, Andrea Aspertia, Alberto Naibo, and Claudio Sacerdoti Coen, explore three core issues at the intersection of machine learning and mathematical cognition: the trade-offs between using formal versus informal mathematics as training data, the underlying reasons why proof generation is more difficult than code synthesis, and whether LLMs truly represent an evolving logical state or merely mimic patterns from their training data.

Formal vs. Informal Mathematics: A Key Distinction

The paper highlights a crucial distinction between informal and formal mathematics. Informal mathematics, as found in textbooks and most articles, is primarily written in natural language, incorporating symbols and logical connectors. It relies on human understanding and interpretation of reasoning steps, which are not always explicitly defined by a fixed set of rules. For example, a textbook proof might use phrases like “thus” or “therefore” to guide the reader, assuming a shared understanding of the logical progression.

In contrast, formal mathematics, typically generated using proof assistants like Coq, Isabelle, or Lean, employs a fully symbolic language with explicit, finite sets of rules for logical connectives and quantifiers. Every mathematical symbol is tied to a precise definition or axiom. Proofs are sequences of rule applications that are mechanically verifiable. This means a computer can definitively check if a formal proof is valid, unlike informal proofs which require human analysis and understanding.

How LLMs Learn Mathematics

The training of LLMs involves several stages. Initially, a model undergoes “pretraining” on vast amounts of unlabeled text data, learning syntax, semantics, and world knowledge. This makes the model versatile across different domains. Subsequently, “supervised fine-tuning” (SFT) adjusts the model’s weights on curated input-output pairs to match specific reasoning patterns or formats. For instance, an LLM might be fine-tuned on a dataset of Lean formal proofs to learn how to generate valid proof steps.

To further enhance performance, “reward model training” teaches a separate model to predict human preferences, acting as a proxy for human judgment. This reward model then guides “reinforcement learning” (RL), where the LLM learns to maximize a cumulative reward signal. In mathematical contexts, this reward could be based on whether a generated program passes tests or if a proof attempt succeeds with a proof assistant. Finally, “inference-time scaling” techniques, such as massive sampling and reranking, iterative refinement, or tool use, are employed to improve performance without altering the model’s core parameters.

Benchmarks and Progress

The paper surveys several recent and challenging datasets used to benchmark LLMs in mathematical reasoning. These include:

AIME 2024: A collection of challenging high school mathematics competition problems requiring multi-step symbolic reasoning. DeepSeek-R1 currently shows strong performance here.
PGPS9K: A large-scale benchmark for plane geometry problem-solving, emphasizing multimodal reasoning by combining natural language descriptions with diagrams. GOLD and PGPSNet are notable systems in this area.
miniF2F: Designed to evaluate formal mathematics problem-solving, it pairs natural language problems with formal statements in theorem provers like Lean. Kimina-Prover, DeepSeek-Prover, and Lyra are leading models in this domain.
FrontierMath: A cutting-edge benchmark with exceptionally challenging, original problems spanning modern mathematics, designed to avoid data contamination. Leading general-purpose LLMs like OpenAI’s o3 have shown some success, though still limited.

Why Proving is Harder Than Coding for LLMs

Despite the theoretical parallels between programming and proving (like the Curry-Howard correspondence), LLMs find formal proving significantly more difficult. The paper attributes this to the nature of errors and how systems respond to them. In programming, errors often provide useful, localized feedback; a program might fail some tests but pass others, allowing for incremental debugging. The “error landscape” is relatively smooth, facilitating exploration and refinement.

In contrast, formal proofs have a “brittle error surface.” A single invalid tactic can completely invalidate a proof, and the cause of an error might be far removed from its logical origin. There’s no concept of partial correctness; a proof is either valid or not. This makes learning from mistakes much harder, as failures offer little information on how to improve, often requiring a complete restructuring of the proof strategy.

LLMs and Computational State

A core question is whether LLMs truly possess a notion of computational state. For coding, LLMs can generate coherent programs and even diagnose errors, but this is largely due to their ability to recognize patterns of code usage and structure from vast training data. They don’t internally simulate program execution but predict tokens based on statistical likelihoods that reflect plausible state transitions. This leads to linguistic fluency in code without a deep, grounded understanding of its behavioral consequences.

This “superficial mastery” doesn’t translate well to formal reasoning, where correctness demands precise symbolic transformations. Informal mathematical reasoning, however, lies in a nuanced middle ground. While LLMs show strong capabilities in chain-of-thought reasoning for informal math, the paper suggests they still lack an internal notion of “proof distance” or “partial validity.” Feedback is delayed, as correctness can only be assessed after the entire argument is complete, making planning and learning from examples challenging.

Iterative Proof and Revision Loops

To compensate for LLMs’ lack of explicit internal state, researchers are exploring inference-time strategies that involve iterative interaction with external agents. These “interactive revision loops” involve the LLM generating a step, an external tool (like a proof assistant) verifying it, and then the LLM revising based on feedback. Systems like Kimina-Prover and DeepSeek-Prover use this generate-verify-retry structure.

However, these fine-grained loops can be fragile and computationally expensive, as they often don’t align perfectly with how LLMs are trained (next-token prediction on continuous text). The feedback from proof assistants is often binary (success/failure), providing little insight into *why* a failure occurred. An alternative, “intra-pass self-correction,” involves the model detecting and correcting errors within a single generation, but this is still in early stages and can lead to “performative self-correction” where the model inserts errors just to show it can fix them.

Also Read:

Future Directions

The paper concludes that while informal mathematics aligns well with current LLM capabilities due to its narrative style and data availability, formal mathematics presents unique challenges. A key area for future development is the adoption of “declarative” proof styles, which explicitly state intermediate goals and justifications, making them more aligned with LLM generation and human-readable. This contrasts with the current “procedural” style of many proof assistants, which relies on low-level tactics.

Another promising direction is teaching LLMs to switch between and mix forward and backward reasoning styles. Forward reasoning proceeds from premises to conclusion, while backward reasoning starts from the goal and works backward. Flexibility in these styles could lead to better proof planning, hierarchical problem decomposition, and more interpretable reasoning. Ultimately, the goal is to enable LLMs to “think like a mathematician” rather than merely “think like a prover,” bridging the gap between human intuition and formal rigor.