TLDR: UA-Code-Bench is a new open-source benchmark evaluating Large Language Models (LLMs) for code generation in Ukrainian, using 500 competitive programming problems from the Eolymp platform. It found that even top LLMs like OpenAI o3 and GPT-5 solve only about half of the problems, with performance sharply decreasing for harder tasks. The benchmark highlights LLMs’ current limitations in deep algorithmic reasoning and efficiency for low-resource languages, emphasizing the need for better multilingual training and more robust evaluation methods.
Large Language Models (LLMs) have transformed many aspects of work, including coding. Tools like GitHub Copilot have significantly boosted developer productivity. However, most benchmarks for evaluating these models primarily focus on English, leaving a significant gap for low-resource languages like Ukrainian. This disparity means that users in these languages often have to translate their coding problems into English, which can lead to inaccuracies and slow down work.
Addressing this critical need, researchers Mykyta V. Syromiatnikov and Victoria M. Ruvinskaya have introduced UA-Code-Bench, the first Ukrainian language-native code generation benchmark. This innovative open-source benchmark is designed to rigorously evaluate LLMs’ ability to generate correct and efficient code for competitive programming tasks described in Ukrainian. The full research paper can be found here: UA-Code-Bench Research Paper.
The Challenge of Low-Resource Languages
Existing LLM benchmarks for Ukrainian often involve simple tasks like classification or question answering, or are merely translations of English evaluation sets. These do not adequately test the complex code generation capabilities required for real-world applications. The quality of input directly impacts the output, and when a coding challenge is posed in Ukrainian, even advanced code assistants may fail, highlighting a fairness and access issue for non-English speakers.
Introducing UA-Code-Bench
UA-Code-Bench is built upon 500 competitive programming problems sourced from the Eolymp platform, a widely used Ukrainian platform known for its Ukrainian-language problem statements and automated judging system with hidden test cases. These problems are evenly distributed across five difficulty levels, ranging from “very easy” to “very hard.” This diverse set ensures a comprehensive evaluation, covering basic math and text processing to complex algorithmic challenges requiring efficient implementations.
The benchmark evaluated 13 leading LLMs, including both proprietary and open-source models. Each model was given a one-shot prompt with a Ukrainian example and a short correct solution to guide the expected format. Solutions were generated in Python 3, with a 30-minute timeout and low temperature sampling parameters. An automated submission tool was used to submit the generated code to the Eolymp judge, which assessed correctness against private tests. Key metrics included pass@1 (100% hidden tests passed) and average score, complemented by indicators for unique solutions, execution time, memory consumption, and error types.
Key Findings and Performance Insights
The evaluation revealed that even the top-performing models, such as OpenAI o3, OpenAI GPT-5, and OpenAI o4-mini, managed to solve only about half of the problems (246, 244, and 238 accepted solutions out of 486 total tasks, respectively). This clearly demonstrates the significant challenge of code generation in a low-resource natural language like Ukrainian, especially for more complex tasks. About half of all problems, primarily those categorized as hard or very hard, remained unsolved by any model.
Performance sharply declined as task complexity increased, indicating that while models can handle direct translation of simple Ukrainian instructions, they often lack the deep algorithmic reasoning required for more intricate problems. Interestingly, GPT-5 stood out by uniquely solving 12 tasks, suggesting some advanced reasoning capabilities. However, the small number of unique solutions across models points to a heavy overlap in the types of simple problems they can successfully tackle.
Beyond correctness, the research also analyzed computational efficiency. OpenAI o3 proved to be the “speed champion,” producing the fastest solutions for 44 tasks, while GPT-5 excelled in memory efficiency, generating the most memory-optimized code for 47 instances. This highlights a crucial trade-off: there isn’t a single “best” model, as different models exhibit distinct strengths in speed versus memory, which can be more or less desirable depending on specific application constraints.
Also Read:
- Unifying Software Engineering Evaluation for AI Coding Agents with SWE-Compass
- Unmasking Vulnerabilities: A New Benchmark for Multi-Agent LLM System Security
Conclusion and Future Directions
UA-Code-Bench underscores the value of competitive programming benchmarks in evaluating LLMs, particularly in underrepresented languages. It exposes limitations in current AI models regarding multilingual training and generalization abilities. The research suggests that while LLMs can handle complex Ukrainian tasks, their performance degrades significantly with increasing complexity, indicating that model intelligence and generalization in low-resource languages are still far from their peak.
This work paves the way for future research focusing on multilingual code generation and reasoning-enhanced models. Future work could expand the benchmark to include more programming languages, additional problem sources, and multimodal tasks that combine text and visual information. It also calls for a more in-depth analysis of data contamination and fine-grained error categorization to further refine our understanding of LLM capabilities in this domain.


