TLDR: This research paper by Ronan McGovern demonstrates an efficient method for adapting Tiny Recursive Models (TRM) to solve Abstract Reasoning Challenge (ARC) tasks within strict competition compute limits. By pre-training a TRM on a large dataset and then performing full fine-tuning on new tasks during the competition, the model achieved a 6.67% score on semi-private evaluation tasks in significantly fewer optimization steps than training from scratch. The study highlights that starting with a pre-trained model is crucial for accelerating test-time adaptation and that full fine-tuning of both task embeddings and the model’s core network is key to achieving optimal performance.
The Abstract Reasoning Challenge (ARC) Prize is a prestigious competition that challenges participants to develop computer programs capable of solving abstract tasks efficiently. These tasks are designed to be intuitive for humans but notoriously difficult for traditional programmable approaches. A critical constraint of the competition is the strict limit on computational resources, meaning solutions must not only generalize to unseen tasks but also operate within tight compute budgets.
Historically, early ARC tasks were tackled with brute-force search methods. However, the 2024 competition saw a shift towards deep transformer models, pre-trained on ARC tasks and then fine-tuned during the competition. These approaches significantly boosted scores on ARC AGI I. The 2025 ARC AGI II tasks, however, presented a much higher level of difficulty, with previous leading models struggling to achieve more than a few percent.
The Rise of Tiny Recursive Models (TRM)
In 2025, a new paradigm emerged with Hierarchical Reasoning Models (HRM), followed by Tiny Recursive Models (TRM). Unlike the billion-parameter transformer models, TRMs are significantly smaller, typically less than 100 million parameters, and utilize recursive layers. The TRM approach, specifically a 7-million parameter single neural network with task embeddings, achieved a 7.8% score on ARC AGI II tasks, making it a leading open-source method at the time. However, this impressive performance came at a cost: the pre-training phase required over 48 hours on four H100 SXM GPUs, far exceeding the ARC Prize 2025 competition’s allowance of 12 hours on four L4 accelerators.
Bridging the Compute Gap: Test-Time Adaptation
The core challenge addressed by Ronan McGovern’s research was how to adapt the powerful TRM approach to fit within the competition’s stringent compute limits. The paper demonstrates that by starting with a TRM model that has already been pre-trained on public ARC tasks, one can efficiently fine-tune it on new competition tasks. This test-time adaptation significantly accelerates the training process.
The pre-training involved a model trained on 1,280 public tasks for over 700,000 optimizer steps across 48 hours, achieving a 10% score on the public evaluation set. Crucially, this pre-trained model was then post-trained (fine-tuned) in just 12,500 gradient steps during the competition, leading to a score of 6.67% on semi-private evaluation tasks. This performance was achieved through full fine-tuning of the tiny model, rather than less comprehensive methods like LoRA fine-tuning or fine-tuning only task embeddings.
Methodology and Key Findings
The research explored various pre-training and post-training strategies. Three different pre-trained models were developed: one replicating the original TRM paper, another with an expanded dataset and longer training, and a third with a smaller dataset filtered for harder ARC AGI II tasks. For post-training, methods included full fine-tuning, fine-tuning only embeddings, a combination of embeddings-first then full fine-tuning, and LoRA with embeddings fine-tuning.
The results clearly showed that pre-training a TRM from scratch is not feasible within the competition’s compute limits, yielding near-zero scores. In contrast, fine-tuning a pre-trained model dramatically reduced the number of optimizer steps required to achieve performance improvements. The most effective strategy was full fine-tuning, or a combination where embeddings were fine-tuned first, followed by full fine-tuning. Interestingly, training on a smaller, harder dataset for longer did not yield better results than training on a larger, more diverse dataset, suggesting the importance of data diversity in pre-training.
The paper also delves into the architecture of TRMs, particularly the use of augmentation-specific task embeddings. While these embeddings contribute significantly to the model’s overall parameter count, they proved more effective than explicitly encoding augmentation types, hinting at a useful form of generalization forced by this design choice.
Also Read:
- Frugal Reasoning: Making Language Models More Concise and Efficient in Math Tasks
- Predicting the Value of Thought: How Re-FORC Optimizes LLM Reasoning
Implications and Future Directions
This research highlights the effectiveness of test-time adaptation for complex abstract reasoning challenges under tight compute constraints. It demonstrates that even small models can achieve meaningful performance by leveraging extensive pre-training followed by efficient fine-tuning. The findings suggest that the pre-trained network effectively learns a ‘landscape’ of problem-solving primitives, which can then be quickly adapted to new, unseen tasks.
Future work could explore optimizing model dimensions, hyperparameter tuning, and investigating alternative augmentation strategies to further push performance boundaries. The full research paper provides a comprehensive look into these methods and results. You can read the full paper here.


