TLDR: Researchers from NVIDIA developed GENCLUSTER, a scalable test-time compute framework that enabled the open-weight language model gpt-oss-120b to achieve a gold medal at the International Olympiad in Informatics (IOI) 2025. This marks the first time an open-weight model has reached this level of performance in the prestigious competitive programming competition, demonstrating a reproducible method for enhancing LLM problem-solving capabilities.
Competitive programming, a challenging arena for human and artificial intelligence alike, has long been a benchmark for evaluating advanced problem-solving and reasoning capabilities. The International Olympiad in Informatics (IOI) stands as one of the most prestigious annual competitions in this field, pushing the boundaries of algorithmic thinking and coding prowess.
While proprietary AI models have previously claimed gold medal-level performance at the IOI, often with their methods kept under wraps, achieving similar results with publicly available, open-weight models has remained a significant hurdle. This gap has now been narrowed by a team of researchers from NVIDIA, who introduced a groundbreaking framework called GENCLUSTER. Their work, detailed in the research paper “SCALING TEST-TIME COMPUTE TO ACHIEVE IOI GOLD MEDAL WITH OPEN-WEIGHT MODELS”, demonstrates a scalable and reproducible approach that has enabled an open-weight model to achieve IOI gold-level performance for the first time.
Introducing GENCLUSTER: A New Approach to Competitive Programming
GENCLUSTER is a sophisticated test-time compute framework designed to enhance the performance of large language models (LLMs) in competitive programming. It operates through a multi-stage pipeline that efficiently explores diverse solution spaces, even under strict validation budgets, such as the limited number of submissions allowed in IOI.
The framework begins with large-scale generation, where the LLM produces a vast pool of potential solutions for a given problem. This is followed by behavioral clustering, a process that groups similar solutions together based on how they perform on various test inputs. By understanding which solutions behave alike, GENCLUSTER can manage the large number of candidates more effectively.
Next, a ranking mechanism, inspired by a round-robin tournament, evaluates these clusters. Representative solutions from different clusters “compete” against each other, with an LLM acting as a judge to determine the better solution in pairwise comparisons. This tournament helps prioritize the most promising clusters.
Finally, a strategic round-robin submission strategy is employed. This strategy carefully selects and submits solutions from the top-ranked clusters, adhering to the IOI’s strict limit of 50 submissions per problem. This ensures that the most likely correct solutions are submitted first, maximizing the chances of achieving a high score.
Breaking Barriers with Open-Weight Models
A key highlight of this research is the achievement of a gold medal at IOI 2025 using the open-weight model gpt-oss-120b, a significant milestone for transparent and reproducible AI evaluation. The experiments showed that GENCLUSTER’s performance consistently scales with the available computational resources, effectively closing the performance gap between open and closed AI systems in competitive programming.
The researchers evaluated several top-performing open-weight models, including DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking, but found gpt-oss-120b to be significantly superior for IOI problems, especially in its ability to scale with more generations. The framework demonstrated that even with the 50-submission limit, increasing the number of generated candidate solutions from 50 to 5000 significantly improved the final scores, moving from a bronze-level performance to a gold medal.
Insights into the Framework’s Components
The study also delved into the impact of various parameters on GENCLUSTER’s effectiveness. For instance, increasing the number of test cases used for behavioral clustering improved the “purity” of the clusters, meaning solutions within a cluster were more consistently good or bad. However, more test cases also led to a larger number of distinct clusters, posing a challenge for selection under submission constraints.
The tournament-based ranking mechanism was found to be crucial, with scores improving as the number of “games” played between clusters increased, saturating after about 10 rounds. This suggests that multiple comparisons are necessary for reliable judgments. Furthermore, the quality of the ranking was high, with the best solution appearing in the top 50 clusters for 35 out of 39 subtasks.
The research also confirmed a correlation between the length of an LLM’s reasoning trace and its accuracy on complex problems. Models like gpt-oss-120b and gpt-oss-20b showed improved performance with longer generation lengths, indicating that providing more “thinking time” to these models can lead to better solutions.
Also Read:
- AutoCode: Empowering AI to Craft Competitive Programming Challenges
- AI Models Advance Mathematical Proofs and Discoveries
A Step Towards Transparent AI Excellence
The GENCLUSTER framework represents a significant advancement in competitive programming AI. By demonstrating gold-medal performance at the IOI with open-weight models and a transparent methodology, this work sets a new benchmark. It provides a reproducible baseline for future research, fostering a more open and collaborative environment for developing advanced reasoning capabilities in large language models.


