DOoM Benchmark: Challenging Language Models with Russian Math and Physics Problems

TLDR: The DOoM (Difficult Olympiads of Math) benchmark is a new open-source dataset designed to evaluate language models’ ability to solve mathematics and physics problems in Russian. Developed by researchers from HSE University, ITMO, and Vikhr Models, it includes problems ranging from school-level to university Olympiad difficulty. Initial evaluations show models perform better on math than physics, and there’s a strong correlation between the number of tokens generated (indicating reasoning depth) and performance. DOoM aims to bridge the gap in Russian-language reasoning benchmarks.

A new open-source benchmark called DOoM, which stands for Difficult Olympiads of Math, has been introduced to evaluate the capabilities of language models in solving mathematics and physics problems specifically in Russian. This benchmark addresses a significant gap in the AI research community, as there have been very few modern and open resources for assessing reasoning skills in the Russian language, especially compared to the abundance of English-language benchmarks.

The research paper, titled “DOoM: Difficult Olympiads of Math,” was authored by Ilya Kuleshov from HSE University, Ilin Pavel from ITMO, and Nikolay Kompanets, Ksenia Sycheva, and Aleksandr Nikolich from Vikhr Models. Their work highlights the growing interest in using language models for complex problem-solving and the need for domain-specific benchmarks.

DOoM is a comprehensive benchmark that includes a wide array of problems, ranging in difficulty from standard school-level tasks to challenging university Olympiad and entrance exam questions. The benchmark is divided into two main datasets: RussianMath, which constitutes 62.1% of the problems, and RussianPhysics, making up the remaining 37.9%.

The data collection process for DOoM involved carefully selecting problems with verifiable solutions. The authors primarily sourced tasks from Russian school textbooks and archives of school and university-level Olympiads, which often provide official solutions or community-verified answers. This approach ensured the reliability of the benchmark, unlike initial attempts to use books like Demidovich (1970) which lacked verifiable solutions.

The RussianMath dataset covers diverse mathematical topics, including combinatorics, algebraic and geometric progressions, and complex number equations. Some advanced problems, such as those from leading university entrance exams, require a deep understanding of concepts like conic sections and coordinate geometry. The RussianPhysics dataset, on the other hand, largely comprises problems from various stages of the All-Russian Olympiad for school children, encompassing mechanics, kinematics, and thermodynamics.

Initial evaluations of several modern language models on the DOoM benchmark revealed interesting trends. The evaluation methodology involved comparing model-generated answers against reference solutions, first as fractions, then natural numbers, and finally by evaluating parsed LaTeX/Python expressions. Each correct answer received a binary score of 1, and 0 otherwise, with overall scores being the average across tasks.

A significant finding was a consistent performance gap: all evaluated models performed better on mathematics tasks than on physics tasks. For instance, Gemini 2.5 Pro achieved the highest overall score with 0.874 in math and 0.582 in physics. The researchers hypothesize that this disparity stems from two main factors: the inherent complexity of physics problems, which often require models to build a qualitative mental model and apply common-sense knowledge before formalizing mathematically, and a potential bias in large language models’ training data towards formal mathematics (like code and proofs) over the more nuanced reasoning required for physics.

Another key insight from the study is a strong positive correlation between the number of tokens generated by a model and its performance. Models that produced more tokens, indicating more extensive reasoning, consistently achieved higher scores, with correlation coefficients of 0.79 for math and 0.60 for physics. This suggests that thorough reasoning, often reflected in longer outputs, is crucial for success on these challenging problems. The paper also noted a moderate positive correlation between processing speed and performance, though with notable outliers, indicating that raw inference speed alone isn’t a definitive predictor of capability.

Also Read:

The introduction of DOoM marks a crucial step forward for the global AI research community, providing a robust tool for assessing and improving language models’ reasoning abilities in the Russian language. The authors anticipate expanding the benchmark’s scope and complexity in the future, welcoming collaborative efforts to evolve this important standard. You can find the full research paper here: DOoM: Difficult Olympiads of Math.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

DOoM Benchmark: Challenging Language Models with Russian Math and Physics Problems

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

A New Way to Disentangle Data for Scientific Exploration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates