Evaluating AI's Coding Prowess in Ukrainian: Introducing UA-Code-Bench

TLDR: UA-Code-Bench is a new open-source benchmark evaluating Large Language Models (LLMs) for code generation in Ukrainian, using 500 competitive programming problems from the Eolymp platform. It found that even top LLMs like OpenAI o3 and GPT-5 solve only about half of the problems, with performance sharply decreasing for harder tasks. The benchmark highlights LLMs’ current limitations in deep algorithmic reasoning and efficiency for low-resource languages, emphasizing the need for better multilingual training and more robust evaluation methods.

Large Language Models (LLMs) have transformed many aspects of work, including coding. Tools like GitHub Copilot have significantly boosted developer productivity. However, most benchmarks for evaluating these models primarily focus on English, leaving a significant gap for low-resource languages like Ukrainian. This disparity means that users in these languages often have to translate their coding problems into English, which can lead to inaccuracies and slow down work.

Addressing this critical need, researchers Mykyta V. Syromiatnikov and Victoria M. Ruvinskaya have introduced UA-Code-Bench, the first Ukrainian language-native code generation benchmark. This innovative open-source benchmark is designed to rigorously evaluate LLMs’ ability to generate correct and efficient code for competitive programming tasks described in Ukrainian. The full research paper can be found here: UA-Code-Bench Research Paper.

The Challenge of Low-Resource Languages

Existing LLM benchmarks for Ukrainian often involve simple tasks like classification or question answering, or are merely translations of English evaluation sets. These do not adequately test the complex code generation capabilities required for real-world applications. The quality of input directly impacts the output, and when a coding challenge is posed in Ukrainian, even advanced code assistants may fail, highlighting a fairness and access issue for non-English speakers.

Introducing UA-Code-Bench

UA-Code-Bench is built upon 500 competitive programming problems sourced from the Eolymp platform, a widely used Ukrainian platform known for its Ukrainian-language problem statements and automated judging system with hidden test cases. These problems are evenly distributed across five difficulty levels, ranging from “very easy” to “very hard.” This diverse set ensures a comprehensive evaluation, covering basic math and text processing to complex algorithmic challenges requiring efficient implementations.

The benchmark evaluated 13 leading LLMs, including both proprietary and open-source models. Each model was given a one-shot prompt with a Ukrainian example and a short correct solution to guide the expected format. Solutions were generated in Python 3, with a 30-minute timeout and low temperature sampling parameters. An automated submission tool was used to submit the generated code to the Eolymp judge, which assessed correctness against private tests. Key metrics included pass@1 (100% hidden tests passed) and average score, complemented by indicators for unique solutions, execution time, memory consumption, and error types.

Key Findings and Performance Insights

The evaluation revealed that even the top-performing models, such as OpenAI o3, OpenAI GPT-5, and OpenAI o4-mini, managed to solve only about half of the problems (246, 244, and 238 accepted solutions out of 486 total tasks, respectively). This clearly demonstrates the significant challenge of code generation in a low-resource natural language like Ukrainian, especially for more complex tasks. About half of all problems, primarily those categorized as hard or very hard, remained unsolved by any model.

Performance sharply declined as task complexity increased, indicating that while models can handle direct translation of simple Ukrainian instructions, they often lack the deep algorithmic reasoning required for more intricate problems. Interestingly, GPT-5 stood out by uniquely solving 12 tasks, suggesting some advanced reasoning capabilities. However, the small number of unique solutions across models points to a heavy overlap in the types of simple problems they can successfully tackle.

Beyond correctness, the research also analyzed computational efficiency. OpenAI o3 proved to be the “speed champion,” producing the fastest solutions for 44 tasks, while GPT-5 excelled in memory efficiency, generating the most memory-optimized code for 47 instances. This highlights a crucial trade-off: there isn’t a single “best” model, as different models exhibit distinct strengths in speed versus memory, which can be more or less desirable depending on specific application constraints.

Also Read:

Conclusion and Future Directions

UA-Code-Bench underscores the value of competitive programming benchmarks in evaluating LLMs, particularly in underrepresented languages. It exposes limitations in current AI models regarding multilingual training and generalization abilities. The research suggests that while LLMs can handle complex Ukrainian tasks, their performance degrades significantly with increasing complexity, indicating that model intelligence and generalization in low-resource languages are still far from their peak.

This work paves the way for future research focusing on multilingual code generation and reasoning-enhanced models. Future work could expand the benchmark to include more programming languages, additional problem sources, and multimodal tasks that combine text and visual information. It also calls for a more in-depth analysis of data contamination and fine-grained error categorization to further refine our understanding of LLM capabilities in this domain.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating AI’s Coding Prowess in Ukrainian: Introducing UA-Code-Bench

The Challenge of Low-Resource Languages

Introducing UA-Code-Bench

Key Findings and Performance Insights

Conclusion and Future Directions

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates