TLDR: The DOoM (Difficult Olympiads of Math) benchmark is a new open-source dataset designed to evaluate language models’ ability to solve mathematics and physics problems in Russian. Developed by researchers from HSE University, ITMO, and Vikhr Models, it includes problems ranging from school-level to university Olympiad difficulty. Initial evaluations show models perform better on math than physics, and there’s a strong correlation between the number of tokens generated (indicating reasoning depth) and performance. DOoM aims to bridge the gap in Russian-language reasoning benchmarks.
A new open-source benchmark called DOoM, which stands for Difficult Olympiads of Math, has been introduced to evaluate the capabilities of language models in solving mathematics and physics problems specifically in Russian. This benchmark addresses a significant gap in the AI research community, as there have been very few modern and open resources for assessing reasoning skills in the Russian language, especially compared to the abundance of English-language benchmarks.
The research paper, titled “DOoM: Difficult Olympiads of Math,” was authored by Ilya Kuleshov from HSE University, Ilin Pavel from ITMO, and Nikolay Kompanets, Ksenia Sycheva, and Aleksandr Nikolich from Vikhr Models. Their work highlights the growing interest in using language models for complex problem-solving and the need for domain-specific benchmarks.
DOoM is a comprehensive benchmark that includes a wide array of problems, ranging in difficulty from standard school-level tasks to challenging university Olympiad and entrance exam questions. The benchmark is divided into two main datasets: RussianMath, which constitutes 62.1% of the problems, and RussianPhysics, making up the remaining 37.9%.
The data collection process for DOoM involved carefully selecting problems with verifiable solutions. The authors primarily sourced tasks from Russian school textbooks and archives of school and university-level Olympiads, which often provide official solutions or community-verified answers. This approach ensured the reliability of the benchmark, unlike initial attempts to use books like Demidovich (1970) which lacked verifiable solutions.
The RussianMath dataset covers diverse mathematical topics, including combinatorics, algebraic and geometric progressions, and complex number equations. Some advanced problems, such as those from leading university entrance exams, require a deep understanding of concepts like conic sections and coordinate geometry. The RussianPhysics dataset, on the other hand, largely comprises problems from various stages of the All-Russian Olympiad for school children, encompassing mechanics, kinematics, and thermodynamics.
Initial evaluations of several modern language models on the DOoM benchmark revealed interesting trends. The evaluation methodology involved comparing model-generated answers against reference solutions, first as fractions, then natural numbers, and finally by evaluating parsed LaTeX/Python expressions. Each correct answer received a binary score of 1, and 0 otherwise, with overall scores being the average across tasks.
A significant finding was a consistent performance gap: all evaluated models performed better on mathematics tasks than on physics tasks. For instance, Gemini 2.5 Pro achieved the highest overall score with 0.874 in math and 0.582 in physics. The researchers hypothesize that this disparity stems from two main factors: the inherent complexity of physics problems, which often require models to build a qualitative mental model and apply common-sense knowledge before formalizing mathematically, and a potential bias in large language models’ training data towards formal mathematics (like code and proofs) over the more nuanced reasoning required for physics.
Another key insight from the study is a strong positive correlation between the number of tokens generated by a model and its performance. Models that produced more tokens, indicating more extensive reasoning, consistently achieved higher scores, with correlation coefficients of 0.79 for math and 0.60 for physics. This suggests that thorough reasoning, often reflected in longer outputs, is crucial for success on these challenging problems. The paper also noted a moderate positive correlation between processing speed and performance, though with notable outliers, indicating that raw inference speed alone isn’t a definitive predictor of capability.
Also Read:
- PRISM: A Dynamic Strategy Framework for Enhanced Mathematical Reasoning in LLMs
- Beyond a Single Roll: Why Repetitions Are Key to Reliable LLM Evaluations
The introduction of DOoM marks a crucial step forward for the global AI research community, providing a robust tool for assessing and improving language models’ reasoning abilities in the Russian language. The authors anticipate expanding the benchmark’s scope and complexity in the future, welcoming collaborative efforts to evolve this important standard. You can find the full research paper here: DOoM: Difficult Olympiads of Math.


