RIMO: A New Benchmark Exposes AI's Mathematical Reasoning Gap

TLDR: RIMO is a new mathematical benchmark designed to accurately evaluate advanced reasoning in large language models (LLMs) by overcoming the limitations of previous benchmarks. It features two tracks: RIMO-N with 335 IMO problems remade for unique integer answers and deterministic grading, and RIMO-P with 456 proof problems decomposed for step-by-step evaluation. Initial tests show a significant performance drop for frontier LLMs on RIMO, highlighting a substantial gap in their Olympiad-level reasoning and proof-writing capabilities.

As artificial intelligence continues to advance, large language models (LLMs) have shown remarkable progress in various domains, including mathematical reasoning. However, evaluating their true capabilities, especially at the level of complex problem-solving found in the International Mathematical Olympiad (IMO), has presented significant challenges. A new research paper introduces RIMO, a novel benchmark designed to provide a clearer, more reliable assessment of advanced mathematical reasoning in LLMs. This work, titled RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning, was authored by Ziye Chen, Chengwei Qin, and Yao Shu.

Why a New Benchmark?

Previous mathematical benchmarks like GSM8K and MATH have seen frontier LLMs achieve over 90% accuracy, indicating a saturation point where further progress is hard to measure. This led the research community to turn to Olympiad-level problems, which demand deeper insight and creative problem-solving. However, existing Olympiad benchmarks often suffer from practical constraints. Some, like dynamic competitions, lack reproducibility. Others, like OLYMMATH and OMNI-MATH, rely on diverse answer formats (fractions, proofs, expressions) that necessitate LLM-based judges, introducing potential bias and evaluation noise. RIMO aims to overcome these limitations by offering a robust and reproducible evaluation framework.

RIMO-N: The Integer Challenge

The RIMO benchmark is divided into two distinct tracks. The first, RIMO-N, comprises 335 problems carefully remade from IMO materials spanning 1959 to 2023. The key innovation here is that each problem is rephrased to yield a single, unique integer answer. This design allows for deterministic, O(1) string-match grading, completely removing the need for subjective, model-based judges. The problems in RIMO-N cover traditional IMO topics such as algebra (96 items), geometry (95 items), number theory (86 items), and combinatorics (58 items), ensuring the benchmark remains faithful to the original Olympiad difficulty.

RIMO-P: The Proof Process

The second track, RIMO-P, focuses on the process of full deductive reasoning. It features 456 original proof problems, each decomposed into a sequence of guided sub-problems. This structure allows for a granular, step-by-step evaluation of a model’s ability to solve intermediate lemmas and construct rigorous proofs. Expert-verified solutions are used to create this decomposition, with problem complexity determining the number of sub-problems (one to four steps). This track provides deeper insights into an LLM’s deductive capabilities beyond just finding a final answer.

Also Read:

What the Evaluations Revealed

The researchers benchmarked ten frontier LLMs, including GPT-4o and Gemini 2.5 Flash, on RIMO, comparing their performance to older benchmarks. The results were striking: while these systems excelled on GSM8K and MATH, their scores dropped sharply on RIMO. For instance, DeepSeek-R1-671B, the top performer, achieved 62.96% on RIMO-N, significantly lower than its 90.45% on MATH. This highlights a substantial gap between current LLM capabilities and genuine Olympiad-level reasoning.

Further analysis revealed several key insights. Performance on RIMO is not solely dictated by model scale or recency; instead, explicit reasoning optimization showed tangible gains, improving performance by up to 19.4 percentage points over vanilla counterparts. The study also found that restricting answers to a binary choice (0 or 1) substantially inflated accuracy across all models, suggesting that a significant portion of RIMO’s challenge comes from forcing models to locate an exact integer within a larger numerical spectrum. On the RIMO-P track, performance was very low across all models, indicating that answer-finding and rigorous proof-writing are distinct capabilities that current models struggle with, leaving a large “proof gap” compared to human students.

RIMO offers a high-resolution yardstick for future research, providing a clear target for closing the profound reasoning gap exposed by these findings. The noise-free framework ensures dependable tracking of real progress as AI systems continue to evolve.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

RIMO: A New Benchmark Exposes AI’s Mathematical Reasoning Gap

Why a New Benchmark?

RIMO-N: The Integer Challenge

RIMO-P: The Proof Process

What the Evaluations Revealed

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates