Putnam-AXIOM: A New Benchmark Reveals LLM Mathematical Reasoning Gaps

TLDR: Putnam-AXIOM is a new benchmark of 522 university-level math problems and 100 functional variations designed to test LLMs’ advanced mathematical reasoning and combat data contamination. Initial results show significant accuracy drops on variations for top models like o1-preview, suggesting reliance on memorization over true reasoning. The benchmark also introduces Teacher-Forced Accuracy (TFA) to evaluate reasoning steps, providing a more comprehensive assessment of LLM capabilities.

Large Language Models (LLMs) have shown impressive capabilities in various fields, including complex problem-solving. However, their progress in mathematical reasoning has hit a ceiling with existing benchmarks, as many models are now achieving very high accuracy, sometimes over 90%. This success is often complicated by “data contamination,” where models might perform well simply because they’ve memorized answers from training data that included these benchmarks.

To address these challenges, researchers from Stanford University have introduced a new benchmark called Putnam-AXIOM. This benchmark is designed to rigorously evaluate the advanced mathematical reasoning abilities of LLMs. It comprises 522 university-level competition problems taken from the prestigious William Lowell Putnam Mathematical Competition, known for its demanding problems that require deep mathematical insight.

A key innovation of Putnam-AXIOM is the “Putnam-AXIOM Variation” dataset. This companion set includes 100 functional variants of the original problems. These variants are generated programmatically by subtly changing variables and constants within the problems. This method creates an unlimited stream of new, equally difficult, and unseen problems, making the benchmark highly resistant to data contamination. The idea is that if an LLM has truly learned to reason, it should be able to solve these variations just as well as the original problems, rather than relying on memorized solutions.

Initial evaluations on the Putnam-AXIOM Original set revealed that even the strongest models struggled significantly. For instance, OpenAI’s o1-preview, the top-performing model evaluated, scored only 41.9%. When tested on the paired Variations, its accuracy dropped by a substantial 19.6% (a 46.8% relative decrease). This consistent downward trend was observed across eighteen other models, with ten showing statistically significant differences, strongly suggesting that memorization plays a role in their performance on static benchmarks.

Beyond traditional “boxed” accuracy (where only the final answer is checked), Putnam-AXIOM also introduces Teacher-Forced Accuracy (TFA). This is a lightweight metric that directly scores the reasoning steps provided by the LLM, automating the evaluation of natural language proofs. TFA helps to assess the actual reasoning process, rather than just the final outcome, which is crucial for complex mathematical problems where a correct final answer might sometimes be achieved through flawed reasoning or even random chance.

The researchers highlight that current evaluation metrics often fall short because they only focus on the final answer, ignoring the reasoning process. For problems with limited possible answers (like true/false), models can get lucky. TFA aims to provide a more complete picture of an LLM’s reasoning abilities by checking if the model predicts each step of a reference solution correctly when “teacher-forced” with the ground truth up to that point.

The Putnam-AXIOM dataset covers a wide range of university-level mathematics topics, including Geometry, Algebra, Trigonometry, Calculus, Linear Algebra, Combinatorics, Probability, Number Theory, Complex Numbers, Differential Equations, and Analysis. To enable automated evaluation, problems were selected or modified to yield a unique, numerically evaluable final answer. This involved adding a trivial next step to some original problems that previously required elaborate proofs, ensuring a single “boxable” answer while preserving the problem’s core difficulty.

The findings from Putnam-AXIOM have significant implications for the development and evaluation of LLMs. The observed accuracy drop on the variation set indicates that many current LLMs still rely on memorized information rather than genuine mathematical reasoning. This suggests that high scores on static benchmarks might overstate a model’s true capabilities. The researchers recommend that future evaluations include dynamic or contamination-checked datasets like Putnam-AXIOM Variations to get a more accurate understanding of LLM progress.

Also Read:

This new benchmark provides a rigorous and contamination-resilient framework for assessing advanced mathematical reasoning in LLMs. The data and evaluation code are publicly available, encouraging further research and development in this critical area. You can find more details about this research paper here: Putnam-AXIOM Research Paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Putnam-AXIOM: A New Benchmark Reveals LLM Mathematical Reasoning Gaps

Gen AI News and Updates

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates