spot_img
HomeResearch & DevelopmentPutnam-AXIOM: A New Benchmark Reveals LLM Mathematical Reasoning Gaps

Putnam-AXIOM: A New Benchmark Reveals LLM Mathematical Reasoning Gaps

TLDR: Putnam-AXIOM is a new benchmark of 522 university-level math problems and 100 functional variations designed to test LLMs’ advanced mathematical reasoning and combat data contamination. Initial results show significant accuracy drops on variations for top models like o1-preview, suggesting reliance on memorization over true reasoning. The benchmark also introduces Teacher-Forced Accuracy (TFA) to evaluate reasoning steps, providing a more comprehensive assessment of LLM capabilities.

Large Language Models (LLMs) have shown impressive capabilities in various fields, including complex problem-solving. However, their progress in mathematical reasoning has hit a ceiling with existing benchmarks, as many models are now achieving very high accuracy, sometimes over 90%. This success is often complicated by “data contamination,” where models might perform well simply because they’ve memorized answers from training data that included these benchmarks.

To address these challenges, researchers from Stanford University have introduced a new benchmark called Putnam-AXIOM. This benchmark is designed to rigorously evaluate the advanced mathematical reasoning abilities of LLMs. It comprises 522 university-level competition problems taken from the prestigious William Lowell Putnam Mathematical Competition, known for its demanding problems that require deep mathematical insight.

A key innovation of Putnam-AXIOM is the “Putnam-AXIOM Variation” dataset. This companion set includes 100 functional variants of the original problems. These variants are generated programmatically by subtly changing variables and constants within the problems. This method creates an unlimited stream of new, equally difficult, and unseen problems, making the benchmark highly resistant to data contamination. The idea is that if an LLM has truly learned to reason, it should be able to solve these variations just as well as the original problems, rather than relying on memorized solutions.

Initial evaluations on the Putnam-AXIOM Original set revealed that even the strongest models struggled significantly. For instance, OpenAI’s o1-preview, the top-performing model evaluated, scored only 41.9%. When tested on the paired Variations, its accuracy dropped by a substantial 19.6% (a 46.8% relative decrease). This consistent downward trend was observed across eighteen other models, with ten showing statistically significant differences, strongly suggesting that memorization plays a role in their performance on static benchmarks.

Beyond traditional “boxed” accuracy (where only the final answer is checked), Putnam-AXIOM also introduces Teacher-Forced Accuracy (TFA). This is a lightweight metric that directly scores the reasoning steps provided by the LLM, automating the evaluation of natural language proofs. TFA helps to assess the actual reasoning process, rather than just the final outcome, which is crucial for complex mathematical problems where a correct final answer might sometimes be achieved through flawed reasoning or even random chance.

The researchers highlight that current evaluation metrics often fall short because they only focus on the final answer, ignoring the reasoning process. For problems with limited possible answers (like true/false), models can get lucky. TFA aims to provide a more complete picture of an LLM’s reasoning abilities by checking if the model predicts each step of a reference solution correctly when “teacher-forced” with the ground truth up to that point.

The Putnam-AXIOM dataset covers a wide range of university-level mathematics topics, including Geometry, Algebra, Trigonometry, Calculus, Linear Algebra, Combinatorics, Probability, Number Theory, Complex Numbers, Differential Equations, and Analysis. To enable automated evaluation, problems were selected or modified to yield a unique, numerically evaluable final answer. This involved adding a trivial next step to some original problems that previously required elaborate proofs, ensuring a single “boxable” answer while preserving the problem’s core difficulty.

The findings from Putnam-AXIOM have significant implications for the development and evaluation of LLMs. The observed accuracy drop on the variation set indicates that many current LLMs still rely on memorized information rather than genuine mathematical reasoning. This suggests that high scores on static benchmarks might overstate a model’s true capabilities. The researchers recommend that future evaluations include dynamic or contamination-checked datasets like Putnam-AXIOM Variations to get a more accurate understanding of LLM progress.

Also Read:

This new benchmark provides a rigorous and contamination-resilient framework for assessing advanced mathematical reasoning in LLMs. The data and evaluation code are publicly available, encouraging further research and development in this critical area. You can find more details about this research paper here: Putnam-AXIOM Research Paper.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -