TLDR: A new research paper demonstrates that Google’s Gemini 2.5 Pro, using a novel pipeline design and prompt engineering, successfully solved five out of six problems from the challenging International Mathematical Olympiad (IMO) 2025. This achievement highlights the importance of strategic AI application in complex reasoning tasks, moving beyond rote calculation to tackle problems requiring deep insight and creativity, and marks a significant advance in automated mathematical reasoning.
The International Mathematical Olympiad (IMO) is renowned for its exceptionally difficult problems, which demand deep mathematical insight, creative thinking, and rigorous formal reasoning. While Large Language Models (LLMs) have shown impressive capabilities on many mathematical benchmarks, they have historically struggled with the unique challenges posed by Olympiad-level tasks.
A recent research paper, titled “Gemini 2.5 Pro Capable of Winning Gold at IMO 2025”, explores the potential of Google’s Gemini 2.5 Pro model in tackling these high-stakes mathematical challenges. Authored by Yichen Huang and Lin F. Yang, the paper highlights a significant advancement: the model successfully solved five out of six newly released IMO 2025 problems. This achievement underscores the critical role of optimizing how powerful AI models are utilized, rather than just relying on their raw capabilities.
The IMO, established in 1959, is an annual competition that brings together the world’s most talented pre-university mathematicians. Participants face three problems in each of two 4.5-hour sessions over two days, covering fields like algebra, geometry, number theory, and combinatorics. Unlike typical math exercises, IMO problems require creative, proof-based reasoning, making them an ideal benchmark for evaluating advanced AI reasoning.
Traditional benchmarks like GSM8K and MATH, which focus on grade-school and high-school level problems, have seen LLMs perform well, often through pattern recognition and data retrieval. However, IMO problems demand multi-step reasoning, abstraction, and innovation akin to human expert-level cognition, exposing limitations in LLMs’ generalization and their susceptibility to “hallucinations” or superficial heuristics. This makes the IMO a crucial test for whether LLMs can truly “reason” or merely replicate memorized solutions.
The paper introduces a novel methodology centered on a pipeline design and careful prompt engineering with the Gemini 2.5 Pro model. A key concern in evaluating LLMs is “data contamination,” where test data might inadvertently be included in the model’s training data, leading to inflated performance. To ensure a fair assessment, this research exclusively used problems from the very recent IMO 2025 competition, which were released just days before the evaluation, minimizing any risk of data leakage.
The Problem-Solving Pipeline
The methodology involves a multi-step pipeline designed to enhance the model’s problem-solving capabilities:
- Step 1: Initial Solution Generation: The Gemini 2.5 Pro model first attempts to solve the problem multiple times to generate a diverse set of initial solutions. This is akin to an exploration phase, aiming to find at least one solution with some overlap with the correct approach. The initial quality of these solutions was observed to be generally low, consistent with other recent findings.
- Step 2: Self-Improvement: The model is then prompted to review and improve its own work. Recognizing that LLMs have a “thinking budget” (limited token capacity for reasoning), this step injects an additional budget, allowing the model to refine its solutions. This iterative improvement process was observed to significantly enhance the quality of the outputs.
- Step 3: Verification: A crucial component of the pipeline is the “verifier.” This component meticulously reviews each solution step-by-step, identifying issues classified as “critical errors” (logical fallacies or factual mistakes) or “justification gaps” (incomplete or insufficiently rigorous arguments).
- Step 4: Check Verification: The bug reports generated by the verifier are reviewed to increase their reliability.
- Step 5: Correction: Based on the bug reports, the model improves its solution. Steps 3-5 are iterated until a solution is accepted (passes the verifier’s check multiple times) or declined (persistent critical errors or major justification gaps).
- Step 6: Accept or Reject: A solution is accepted only if it passes the verifier’s check five times, ensuring high rigor.
The researchers noted that while the verifier is generally reliable, it can make mistakes. However, the iterative nature of the process and the model’s ability to review bug reports (analogous to a peer review process) make the system robust to such errors.
Also Read:
- OpenAI’s Experimental AI Achieves Gold Medal Performance at International Mathematical Olympiad
- Beyond Basic Vision: How Leading AI Models Miss the Bigger Picture
Specific Problems and Approaches
The paper details the solutions for several IMO 2025 problems:
- Problem 1 (Combinatorics): This problem involved determining the number of “sunny” lines (not parallel to x-axis, y-axis, or x+y=0) required to cover specific points in a plane. The model was given a hint to use induction, a general technique that a multi-agent system would likely explore. The possible values for ‘k’ (number of sunny lines) were found to be {0, 1, 3}.
- Problem 2 (Geometry): This complex geometry problem involved circles, intersections, circumcenters, and orthocenters, requiring a proof of tangency. The model was hinted to use analytic geometry. Gemini 2.5 Pro produced an almost correct answer on the first try, with minor calculation mistakes caught by the verifier, making it the “easiest” problem for the AI.
- Problem 3 (Number Theory/Functions): This problem defined a “bonza” function and asked for the smallest constant ‘c’ such that f(n) ≤ cn. The pipeline involved sampling multiple initial solutions and iteratively improving them. The analysis led to the determination of c=4.
- Problem 4 (Number Theory/Sequences): This problem dealt with an infinite sequence where each term is the sum of the three largest proper divisors of the previous term. The analysis showed that terms must be even and divisible by 3, but not by 5, leading to specific forms for the initial term.
- Problem 5 (Game Theory): The “inekoalaty game” involved two players, Alice and Bazza, choosing non-negative real numbers under certain sum and sum-of-squares constraints. The paper determined the winning strategies based on a parameter λ, concluding that Alice wins if λ > √2/2, Bazza wins if λ < √2/2, and it's a draw if λ = √2/2.
- Problem 6: The model only reported a trivial upper bound for this problem, indicating it was not fully solved.
The researchers acknowledge that while their results are impressive, using a diverse set of leading AI models (like Grok 4 or OpenAI’s models) could potentially yield even stronger mathematical capabilities. This research demonstrates a significant leap in automated mathematical reasoning, showing that powerful existing models, when used optimally through sophisticated pipeline design and prompt engineering, are capable of solving highly challenging math problems. You can read the full paper here.


