TLDR: TopoAlign is a framework that addresses the scarcity of training data for Large Language Models (LLMs) in autoformalisation (translating informal math to formal statements). It achieves this by structurally aligning widely available code repositories with formal mathematics. By decomposing code into docstrings, main functions, and dependency functions, and then reassembling them to mirror formal statements, TopoAlign creates a new training dataset. This method, combined with a “code autoformalisation” task, significantly improves the performance of Math LLMs like DEEPSEEK-MATH and HERALD on various autoformalisation benchmarks, demonstrating that code data can effectively transfer problem-solving skills and structural knowledge to mathematical reasoning.
Large Language Models (LLMs) have shown impressive capabilities in various domains, including mathematical reasoning. However, a significant hurdle remains in a task called autoformalisation – the process of translating informal mathematical statements into rigorous, machine-verifiable formal languages like Lean 4. This challenge primarily stems from a scarcity of large-scale, high-quality datasets that pair informal problem descriptions with their formal counterparts.
To address this data bottleneck, researchers have introduced a novel framework called TopoAlign. This innovative approach leverages the vast and readily available resources of code repositories to enhance the training of Math LLMs. The core idea is to recognize and exploit the structural similarities between programming code and formal mathematical statements, enabling LLMs to learn compositional patterns that are crucial for autoformalisation.
How TopoAlign Works
TopoAlign operates by topologically decomposing programming code at the function level into three key components. First, the ‘docstring’ of a function is identified as an analogue to an informal problem statement in mathematics. Second, the ‘main function body’ serves as a proxy for the formal mathematical statement itself. Third, ‘dependency functions’ – the auxiliary functions called by the main function – are mapped to supporting lemmata or theorems found in mathematical libraries like Mathlib.
To achieve this decomposition, TopoAlign employs a topological dependency parser. This parser performs a breadth-first search to build function-level dependency graphs, tracing function calls and definitions across files. This allows it to capture intricate intra-file functional relationships, going beyond simpler file-level dependency extraction methods.
Recognizing that standard docstrings might not always perfectly align with the needs of autoformalisation, TopoAlign also generates concise summaries of each main function’s logic using an LLM. This ensures that the extracted informal descriptions are more fitting analogues to mathematical problem statements. Furthermore, the framework carefully selects code repositories whose hierarchical dependency patterns, in terms of tree depth and sibling counts, closely resemble those found in formal mathematics, ensuring a better structural match.
Code Autoformalisation: A New Training Task
Building on the structurally aligned code, TopoAlign introduces a new training task called “code autoformalisation” (CAF). This task emulates the autoformalisation process directly on code data. By treating each aligned code instance as a formalisation scenario, the model learns structural patterns and transfers general problem-solving strategies acquired from programming to mathematical contexts.
Math LLMs are trained using a mixture of TopoAlign code data and formal mathematical data. This combined approach integrates both mathematical knowledge and structure with problem-solving capabilities from code. During training, the model is given a problem description (either a docstring for code or an informal statement for math) and its dependencies, and is tasked with generating the corresponding solution (the main function for code or the formal statement for math).
Also Read:
- DRIFT: A Framework for Enhancing AI’s Understanding and Formalization of Mathematics
- Ax-Prover: Advancing Automated Theorem Proving Across Science with LLM Agents
Performance and Insights
The researchers evaluated TopoAlign by training two state-of-the-art models, DEEPSEEK-MATH and HERALD, and testing them on several Lean 4 autoformalisation benchmarks, including MiniF2F, Putnam, and ProofNet. The results were compelling: TopoAlign consistently delivered substantial performance gains for both models across most datasets, particularly in bidirectional equivalence (BEq), a metric that assesses semantic fidelity.
For instance, DEEPSEEK-MATH saw its BEq@1 score on the MiniF2F-valid dataset increase significantly. Even HERALD, a model already specialized in autoformalisation, benefited from TopoAlign, showing improvements in BEq@1 on MiniF2F-valid and MiniF2F-test. These findings underscore that structurally aligned code data effectively transfers both problem-solving skills and structural knowledge from code to mathematical autoformalisation tasks.
An ablation study further revealed the importance of a balanced mix of code and mathematical data during training. A 50/50 ratio consistently yielded optimal performance, suggesting that mathematical data enhances syntactic accuracy, while code data boosts problem-solving capabilities. Interestingly, a qualitative error analysis highlighted that a common issue was the incorrect assignment of variable types (e.g., integer vs. natural number), possibly due to Python’s less strict typing system in the code data. This suggests future work could explore programming languages with stricter typing to mitigate such errors.
In conclusion, TopoAlign demonstrates that widely available code repositories are a valuable, previously untapped resource for developing more capable Math LLMs. By structurally aligning code with formal mathematics, this framework provides a scalable solution to the scarcity of mathematical training data, significantly improving autoformalisation performance. For more details, you can refer to the original research paper: TOPOALIGN: A FRAMEWORK FOR ALIGNING CODE TO MATH VIA TOPOLOGICAL DECOMPOSITION.


