New Benchmark Reveals LLMs' Ongoing Struggle with Advanced High School Math

TLDR: A new benchmark called AMO-Bench, featuring 50 original, expert-validated problems at International Mathematical Olympiad difficulty, shows that even top large language models (LLMs) achieve only 52.4% accuracy. This indicates significant room for improvement in their mathematical reasoning capabilities, despite their strong performance on older, less challenging benchmarks. The study highlights the need for more advanced reasoning in LLMs and suggests potential for improvement through increased computational effort.

Recent advancements in large language models (LLMs) have shown impressive progress in various reasoning tasks. However, a new study introduces a challenging benchmark, AMO-Bench, revealing that these advanced AI models still face significant hurdles when it comes to solving high school mathematics competition problems, particularly those at an Olympiad level.

Existing math benchmarks, such as AIME24/25, have seen top-tier LLMs achieve remarkable accuracy, sometimes exceeding 90%. While this indicates progress, it also highlights a growing problem: these benchmarks are becoming less effective for evaluating further advancements because models are reaching performance saturation. Additionally, many current benchmarks use problems from past competitions, raising concerns about models potentially memorizing data rather than genuinely reasoning.

Introducing AMO-Bench: A New Standard for Math Reasoning

To address these limitations, researchers from Meituan, the University of Chinese Academy of Sciences, and Harbin Institute of Technology have developed AMO-Bench. This advanced mathematical reasoning benchmark comprises 50 entirely original, human-crafted problems designed to be exceptionally challenging. The key features that make AMO-Bench a rigorous assessment include:

Original Problems: All 50 problems are newly created by human experts, ensuring no performance leakage from data memorization. A secondary verification process confirms their originality against existing competitions and online resources.
Guaranteed Difficulty: Each problem is cross-validated by multiple experts to meet or exceed the difficulty standards of the International Mathematical Olympiad (IMO). An additional LLM-based filtering stage excludes questions that are not sufficiently challenging for current models.
Final-Answer Based Grading: Unlike proof-based problems that require manual verification, AMO-Bench problems only require a final answer. This enables efficient, automatic, and robust grading, balancing accuracy with scalability.
Human-Annotated Reasoning Paths: Each problem comes with a detailed, step-by-step solution written by human experts. These annotations provide transparency and can support further research, such as prompt engineering and error analysis.

LLMs Still Struggle with Advanced Math

The experimental results across 26 different LLMs on AMO-Bench demonstrate that even the best-performing model, GPT-5-Thinking (High), achieved only 52.4% accuracy. Most LLMs scored below 40%, indicating substantial room for improvement in their complex mathematical reasoning abilities. This performance contrasts sharply with their high scores on older benchmarks.

The study also found that higher-performing models tend to require significantly more output tokens to generate their answers on AMO-Bench. For instance, GPT-5-Thinking (High) generated an average of approximately 37,000 output tokens for AMO-Bench problems, compared to about 7,000 and 6,000 tokens for AIME25 and AIME24, respectively. This exceptionally high token consumption further underscores the difficulty of AMO-Bench for current LLMs.

Also Read:

Future Potential and Scaling Trends

Despite the current low performance, the analysis reveals promising scaling trends. Model performance shows a near-linear growth trend relative to the logarithm of output length, suggesting that increasing the inference budget can lead to further improvements. Furthermore, top-tier models achieve pass@32 rates exceeding 70%, indicating they possess the underlying capability to solve these challenging problems, even if they don’t consistently find the correct reasoning path yet.

AMO-Bench serves as a critical new tool for evaluating and advancing the mathematical reasoning capabilities of large language models. The benchmark’s data and evaluation code are publicly available, encouraging further research in this crucial area. You can find the full research paper here: AMO-Bench: Large Language Models Still Struggle in High School Math Competitions.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

New Benchmark Reveals LLMs’ Ongoing Struggle with Advanced High School Math

Introducing AMO-Bench: A New Standard for Math Reasoning

LLMs Still Struggle with Advanced Math

Future Potential and Scaling Trends

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates