TLDR: A new study evaluates four large language models (OpenAI GPT-4o, o1, DeepSeek-V3, DeepSeek-R1) on challenging arithmetic, algebra, and number theory problems. It identifies common errors like procedural slips and conceptual misunderstandings. While some models like OpenAI o1 show high accuracy, the research highlights that dual-agent collaboration significantly improves performance for other models, suggesting a promising path for more reliable AI integration in mathematics education.
Large Language Models (LLMs) are rapidly becoming integral to AI-driven education, particularly in mathematics. Their ability to generate accurate answers and detailed solutions for math problems is crucial for providing reliable feedback and assessment. However, a recent study delves into a critical question: how reliably can these advanced AI models perform mathematical computations and reasoning?
Researchers evaluated four prominent LLMs: OpenAI GPT-4o, OpenAI o1, DeepSeek-V3, and DeepSeek-R1. Instead of relying on standard benchmarks, the team intentionally designed challenging math tasks across three categories: arithmetic (multiplying two 5-digit numbers), algebra (solving quadratic word problems), and number theory (finding solutions to Diophantine equations). These tasks were specifically crafted to expose potential errors and limitations in the models’ mathematical capabilities.
The study employed two main configurations for testing the LLMs: a single-agent setup, where each model worked independently, and a dual-agent setup, where two base LLMs (GPT-4o and DeepSeek-V3) collaborated through chat-based discussions to derive solutions. Every solution generated was meticulously broken down into individual steps and analyzed for accuracy, identifying specific procedural or conceptual errors.
Key Findings on Single-Agent Performance
The results from the single-agent scenario revealed varying levels of performance among the LLMs. The reasoning-enhanced OpenAI o1 model consistently demonstrated high or nearly perfect accuracy across all three math task categories. DeepSeek-V3 also performed strongly, especially in arithmetic and algebra after initial iterations. In contrast, OpenAI GPT-4o showed lower performance, particularly in multiplying 5-digit numbers and solving Diophantine equations. DeepSeek-R1, another reasoning-enhanced model, surprisingly struggled, often exhibiting an ‘overthinking’ phenomenon that hindered its ability to reach correct answers.
Analysis of the errors showed that ‘procedural slips’ – such as arithmetic mistakes or symbolic manipulation errors – were the most frequent type of error and significantly impacted overall performance. ‘Conceptual misunderstandings’, though less frequent, also played a role in some models’ reduced accuracy. Interestingly, some incorrect final answers occurred even when no clear step-level errors were identified, suggesting that LLMs’ reliance on token prediction rather than explicit numerical computation might lead to subtle inaccuracies.
The Power of Collaboration: Dual-Agent Improvements
A significant finding of the study was the substantial improvement observed in the dual-agent configurations. When GPT-4o and DeepSeek-V3 collaborated, their performance notably increased across all problem types. For instance, GPT-4o’s accuracy in multiplying 5-digit numbers improved significantly, and both models achieved perfect accuracy on quadratic equations when working together. DeepSeek-V3 also saw a remarkable increase in accuracy for Diophantine equations in the dual-agent setup, reaching perfect scores. This highlights that collaborative intelligence among LLMs can replicate the benefits of human collaboration, enhancing efficiency through shared perspectives, cross-validation, and emergent reasoning.
Also Read:
- Splitting Minds: How Two AI Agents Outperform One in Mathematical Problem Solving
- Stress-Testing LLMs: New Benchmark Reveals Fragility in Mathematical Reasoning
Implications for AI in Education
These findings offer valuable insights for integrating LLMs into mathematics education. Models with stronger numerical competence, like o1, show promise for scalable automated formative assessment due to their reliable step-level annotations. The success of dual-agent collaboration suggests a promising avenue for future improvements in AI-driven instructional practices and assessment precision. Future research will explore more detailed error labeling, integrating LLMs with external computational tools like calculators or spreadsheets, and further developing multi-agent approaches to enhance problem-solving capabilities. The full research paper can be accessed here: Mathematical Computation and Reasoning Errors by Large Language Models.


