AI Breakthrough: Gemini 2.5 Pro Achieves Gold Medal Performance on IMO 2025 Problems

TLDR: A new research paper demonstrates that Google’s Gemini 2.5 Pro, using a novel pipeline design and prompt engineering, successfully solved five out of six problems from the challenging International Mathematical Olympiad (IMO) 2025. This achievement highlights the importance of strategic AI application in complex reasoning tasks, moving beyond rote calculation to tackle problems requiring deep insight and creativity, and marks a significant advance in automated mathematical reasoning.

The International Mathematical Olympiad (IMO) is renowned for its exceptionally difficult problems, which demand deep mathematical insight, creative thinking, and rigorous formal reasoning. While Large Language Models (LLMs) have shown impressive capabilities on many mathematical benchmarks, they have historically struggled with the unique challenges posed by Olympiad-level tasks.

A recent research paper, titled “Gemini 2.5 Pro Capable of Winning Gold at IMO 2025”, explores the potential of Google’s Gemini 2.5 Pro model in tackling these high-stakes mathematical challenges. Authored by Yichen Huang and Lin F. Yang, the paper highlights a significant advancement: the model successfully solved five out of six newly released IMO 2025 problems. This achievement underscores the critical role of optimizing how powerful AI models are utilized, rather than just relying on their raw capabilities.

The IMO, established in 1959, is an annual competition that brings together the world’s most talented pre-university mathematicians. Participants face three problems in each of two 4.5-hour sessions over two days, covering fields like algebra, geometry, number theory, and combinatorics. Unlike typical math exercises, IMO problems require creative, proof-based reasoning, making them an ideal benchmark for evaluating advanced AI reasoning.

Traditional benchmarks like GSM8K and MATH, which focus on grade-school and high-school level problems, have seen LLMs perform well, often through pattern recognition and data retrieval. However, IMO problems demand multi-step reasoning, abstraction, and innovation akin to human expert-level cognition, exposing limitations in LLMs’ generalization and their susceptibility to “hallucinations” or superficial heuristics. This makes the IMO a crucial test for whether LLMs can truly “reason” or merely replicate memorized solutions.

The paper introduces a novel methodology centered on a pipeline design and careful prompt engineering with the Gemini 2.5 Pro model. A key concern in evaluating LLMs is “data contamination,” where test data might inadvertently be included in the model’s training data, leading to inflated performance. To ensure a fair assessment, this research exclusively used problems from the very recent IMO 2025 competition, which were released just days before the evaluation, minimizing any risk of data leakage.

The Problem-Solving Pipeline

The methodology involves a multi-step pipeline designed to enhance the model’s problem-solving capabilities:

Step 1: Initial Solution Generation: The Gemini 2.5 Pro model first attempts to solve the problem multiple times to generate a diverse set of initial solutions. This is akin to an exploration phase, aiming to find at least one solution with some overlap with the correct approach. The initial quality of these solutions was observed to be generally low, consistent with other recent findings.
Step 2: Self-Improvement: The model is then prompted to review and improve its own work. Recognizing that LLMs have a “thinking budget” (limited token capacity for reasoning), this step injects an additional budget, allowing the model to refine its solutions. This iterative improvement process was observed to significantly enhance the quality of the outputs.
Step 3: Verification: A crucial component of the pipeline is the “verifier.” This component meticulously reviews each solution step-by-step, identifying issues classified as “critical errors” (logical fallacies or factual mistakes) or “justification gaps” (incomplete or insufficiently rigorous arguments).
Step 4: Check Verification: The bug reports generated by the verifier are reviewed to increase their reliability.
Step 5: Correction: Based on the bug reports, the model improves its solution. Steps 3-5 are iterated until a solution is accepted (passes the verifier’s check multiple times) or declined (persistent critical errors or major justification gaps).
Step 6: Accept or Reject: A solution is accepted only if it passes the verifier’s check five times, ensuring high rigor.

The researchers noted that while the verifier is generally reliable, it can make mistakes. However, the iterative nature of the process and the model’s ability to review bug reports (analogous to a peer review process) make the system robust to such errors.

Also Read:

Specific Problems and Approaches

The paper details the solutions for several IMO 2025 problems:

Problem 1 (Combinatorics): This problem involved determining the number of “sunny” lines (not parallel to x-axis, y-axis, or x+y=0) required to cover specific points in a plane. The model was given a hint to use induction, a general technique that a multi-agent system would likely explore. The possible values for ‘k’ (number of sunny lines) were found to be {0, 1, 3}.
Problem 2 (Geometry): This complex geometry problem involved circles, intersections, circumcenters, and orthocenters, requiring a proof of tangency. The model was hinted to use analytic geometry. Gemini 2.5 Pro produced an almost correct answer on the first try, with minor calculation mistakes caught by the verifier, making it the “easiest” problem for the AI.
Problem 3 (Number Theory/Functions): This problem defined a “bonza” function and asked for the smallest constant ‘c’ such that f(n) ≤ cn. The pipeline involved sampling multiple initial solutions and iteratively improving them. The analysis led to the determination of c=4.
Problem 4 (Number Theory/Sequences): This problem dealt with an infinite sequence where each term is the sum of the three largest proper divisors of the previous term. The analysis showed that terms must be even and divisible by 3, but not by 5, leading to specific forms for the initial term.
Problem 5 (Game Theory): The “inekoalaty game” involved two players, Alice and Bazza, choosing non-negative real numbers under certain sum and sum-of-squares constraints. The paper determined the winning strategies based on a parameter λ, concluding that Alice wins if λ > √2/2, Bazza wins if λ < √2/2, and it's a draw if λ = √2/2.
Problem 6: The model only reported a trivial upper bound for this problem, indicating it was not fully solved.

The researchers acknowledge that while their results are impressive, using a diverse set of leading AI models (like Grok 4 or OpenAI’s models) could potentially yield even stronger mathematical capabilities. This research demonstrates a significant leap in automated mathematical reasoning, showing that powerful existing models, when used optimally through sophisticated pipeline design and prompt engineering, are capable of solving highly challenging math problems. You can read the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI Breakthrough: Gemini 2.5 Pro Achieves Gold Medal Performance on IMO 2025 Problems

The Problem-Solving Pipeline

Specific Problems and Approaches

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates