TLDR: A study benchmarked five advanced Large Language Models (LLMs) on the International Olympiad on Astronomy and Astrophysics (IOAA) exams from 2022-2025. Gemini 2.5 Pro and GPT-5 achieved gold medal performance in theory exams, often outranking human participants. GPT-5 also excelled in data analysis exams due to strong multimodal capabilities. While LLMs show impressive reasoning, they still struggle with geometric/spatial reasoning and complex plot interpretation, highlighting areas for future development towards fully autonomous AI research agents in astronomy.
Large Language Models (LLMs) are making significant strides in various scientific fields, and astronomy is no exception. A recent study has rigorously benchmarked five state-of-the-art LLMs against the challenging International Olympiad on Astronomy and Astrophysics (IOAA) exams, revealing their impressive capabilities and highlighting areas for further development.
The IOAA exams are renowned for testing deep conceptual understanding, multi-step derivations, and multimodal analysis, making them an ideal benchmark for evaluating the complex reasoning skills required in real-world astronomical research. Unlike simpler question-answering benchmarks, IOAA problems integrate theoretical physics, observational constraints, and real-world astronomical data with mathematical computations.
The research evaluated GPT-5, OpenAI o3, Gemini 2.5 Pro, Claude-4.1-Opus, and Claude-4-Sonnet on IOAA theory and data analysis exams from 2022 to 2025. The observational exams were excluded due to the digital nature of LLMs, as they require physical instruments and direct sky observations.
Exceptional Performance in Theory Exams
In the theory exams, Gemini 2.5 Pro and GPT-5 emerged as the top performers, achieving average scores of 85.6% and 84.2% respectively. These scores not only reached gold medal level but also placed them in the top two ranks among 200-300 human participants across all four IOAA theory exams evaluated. OpenAI o3 also showed competitive performance with an overall score of 77.5%, while the Claude models scored 64.7% and 60.6%.
Interestingly, the study found that LLMs generally performed very well on physics and mathematics problems (Category II), with scores ranging from 67% to 91%. However, a consistent weakness across all models was identified in geometric and spatial reasoning problems (Category I), where scores dropped significantly to 49-78%. This suggests that while LLMs can handle complex calculations and theoretical concepts, they struggle with tasks requiring spatial visualization, spherical trigonometry, and understanding of timekeeping systems.
GPT-5 Leads in Data Analysis
The data analysis exams presented a more varied picture. GPT-5 demonstrated exceptional capabilities, scoring an impressive 88.5% overall, even surpassing its performance in the theory exams. This strong showing is attributed to GPT-5’s superior multimodal capabilities, particularly in interpreting plots and figures, which are crucial for these types of problems. In contrast, other models experienced a 10-15 percentage point drop in performance from theory to data analysis exams, with scores ranging from 48% to 76%.
Error analysis in data analysis exams revealed that plotting and plot/image reading were significant sources of point deductions for most models. Calculation errors also played a more prominent role compared to theory exams, often due to the need to process long tables and perform multiple calculations to generate plots.
Rivaling Human Experts
When compared to human participants, the LLMs consistently performed at a high level. In theory exams, most LLMs achieved gold medal status, with GPT-5 and Gemini 2.5 Pro frequently outperforming the best human students. Even the lower-performing Claude models often scored considerably higher than the median human performance, ranking well within the top participants.
In data analysis exams, GPT-5 and Gemini 2.5 Pro maintained their gold medal level, with GPT-5 even surpassing the best human student in some years. While other models showed more variability, they still achieved respectable bronze or silver medal performances in many instances.
Also Read:
- Language Models Struggle with False Premises: Insights from the BROKENMATH Benchmark
- How Well Do LLMs Tutor? A New Benchmark Reveals Strengths and Weaknesses
Future Directions for AI in Astronomy
The study concludes that while LLMs are capable enough to serve as valuable “AI co-scientists” for tasks like verifying formulas, exploring parameters, and cross-checking astronomical concepts, they are not yet ready to function as fully autonomous research agents. Their answers still require careful validation to address potential calculation errors and conceptual failures, especially in areas like spherical trigonometry and temporal reasoning.
To overcome these limitations, future developments could include implementing visual sketchpads to help models visualize spatial representations, similar to how humans approach geometric problems. Additionally, synthesizing large-scale visual question-answering datasets could enhance LLMs’ multimodal understanding. By addressing these critical gaps, LLMs can transition from impressive problem-solvers to indispensable research partners, accelerating discoveries in astronomy.
You can read the full research paper for more details: Large Language Models Achieve Gold Medal Performance at International Astronomy & Astrophysics Olympiad


