TLDR: RIMO is a new mathematical benchmark designed to accurately evaluate advanced reasoning in large language models (LLMs) by overcoming the limitations of previous benchmarks. It features two tracks: RIMO-N with 335 IMO problems remade for unique integer answers and deterministic grading, and RIMO-P with 456 proof problems decomposed for step-by-step evaluation. Initial tests show a significant performance drop for frontier LLMs on RIMO, highlighting a substantial gap in their Olympiad-level reasoning and proof-writing capabilities.
As artificial intelligence continues to advance, large language models (LLMs) have shown remarkable progress in various domains, including mathematical reasoning. However, evaluating their true capabilities, especially at the level of complex problem-solving found in the International Mathematical Olympiad (IMO), has presented significant challenges. A new research paper introduces RIMO, a novel benchmark designed to provide a clearer, more reliable assessment of advanced mathematical reasoning in LLMs. This work, titled RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning, was authored by Ziye Chen, Chengwei Qin, and Yao Shu.
Why a New Benchmark?
Previous mathematical benchmarks like GSM8K and MATH have seen frontier LLMs achieve over 90% accuracy, indicating a saturation point where further progress is hard to measure. This led the research community to turn to Olympiad-level problems, which demand deeper insight and creative problem-solving. However, existing Olympiad benchmarks often suffer from practical constraints. Some, like dynamic competitions, lack reproducibility. Others, like OLYMMATH and OMNI-MATH, rely on diverse answer formats (fractions, proofs, expressions) that necessitate LLM-based judges, introducing potential bias and evaluation noise. RIMO aims to overcome these limitations by offering a robust and reproducible evaluation framework.
RIMO-N: The Integer Challenge
The RIMO benchmark is divided into two distinct tracks. The first, RIMO-N, comprises 335 problems carefully remade from IMO materials spanning 1959 to 2023. The key innovation here is that each problem is rephrased to yield a single, unique integer answer. This design allows for deterministic, O(1) string-match grading, completely removing the need for subjective, model-based judges. The problems in RIMO-N cover traditional IMO topics such as algebra (96 items), geometry (95 items), number theory (86 items), and combinatorics (58 items), ensuring the benchmark remains faithful to the original Olympiad difficulty.
RIMO-P: The Proof Process
The second track, RIMO-P, focuses on the process of full deductive reasoning. It features 456 original proof problems, each decomposed into a sequence of guided sub-problems. This structure allows for a granular, step-by-step evaluation of a model’s ability to solve intermediate lemmas and construct rigorous proofs. Expert-verified solutions are used to create this decomposition, with problem complexity determining the number of sub-problems (one to four steps). This track provides deeper insights into an LLM’s deductive capabilities beyond just finding a final answer.
Also Read:
- AI Models Face Physics Olympiad Challenge: A New Benchmark Reveals Performance Gaps
- Unlocking Complex Proofs: BFS-Prover-V2 Advances AI in Formal Mathematics
What the Evaluations Revealed
The researchers benchmarked ten frontier LLMs, including GPT-4o and Gemini 2.5 Flash, on RIMO, comparing their performance to older benchmarks. The results were striking: while these systems excelled on GSM8K and MATH, their scores dropped sharply on RIMO. For instance, DeepSeek-R1-671B, the top performer, achieved 62.96% on RIMO-N, significantly lower than its 90.45% on MATH. This highlights a substantial gap between current LLM capabilities and genuine Olympiad-level reasoning.
Further analysis revealed several key insights. Performance on RIMO is not solely dictated by model scale or recency; instead, explicit reasoning optimization showed tangible gains, improving performance by up to 19.4 percentage points over vanilla counterparts. The study also found that restricting answers to a binary choice (0 or 1) substantially inflated accuracy across all models, suggesting that a significant portion of RIMO’s challenge comes from forcing models to locate an exact integer within a larger numerical spectrum. On the RIMO-P track, performance was very low across all models, indicating that answer-finding and rigorous proof-writing are distinct capabilities that current models struggle with, leaving a large “proof gap” compared to human students.
RIMO offers a high-resolution yardstick for future research, providing a clear target for closing the profound reasoning gap exposed by these findings. The noise-free framework ensures dependable tracking of real progress as AI systems continue to evolve.


