Unlocking Efficiency: Test-Time Adaptation for Tiny Recursive Models in the ARC Prize

TLDR: This research paper by Ronan McGovern demonstrates an efficient method for adapting Tiny Recursive Models (TRM) to solve Abstract Reasoning Challenge (ARC) tasks within strict competition compute limits. By pre-training a TRM on a large dataset and then performing full fine-tuning on new tasks during the competition, the model achieved a 6.67% score on semi-private evaluation tasks in significantly fewer optimization steps than training from scratch. The study highlights that starting with a pre-trained model is crucial for accelerating test-time adaptation and that full fine-tuning of both task embeddings and the model’s core network is key to achieving optimal performance.

The Abstract Reasoning Challenge (ARC) Prize is a prestigious competition that challenges participants to develop computer programs capable of solving abstract tasks efficiently. These tasks are designed to be intuitive for humans but notoriously difficult for traditional programmable approaches. A critical constraint of the competition is the strict limit on computational resources, meaning solutions must not only generalize to unseen tasks but also operate within tight compute budgets.

Historically, early ARC tasks were tackled with brute-force search methods. However, the 2024 competition saw a shift towards deep transformer models, pre-trained on ARC tasks and then fine-tuned during the competition. These approaches significantly boosted scores on ARC AGI I. The 2025 ARC AGI II tasks, however, presented a much higher level of difficulty, with previous leading models struggling to achieve more than a few percent.

The Rise of Tiny Recursive Models (TRM)

In 2025, a new paradigm emerged with Hierarchical Reasoning Models (HRM), followed by Tiny Recursive Models (TRM). Unlike the billion-parameter transformer models, TRMs are significantly smaller, typically less than 100 million parameters, and utilize recursive layers. The TRM approach, specifically a 7-million parameter single neural network with task embeddings, achieved a 7.8% score on ARC AGI II tasks, making it a leading open-source method at the time. However, this impressive performance came at a cost: the pre-training phase required over 48 hours on four H100 SXM GPUs, far exceeding the ARC Prize 2025 competition’s allowance of 12 hours on four L4 accelerators.

Bridging the Compute Gap: Test-Time Adaptation

The core challenge addressed by Ronan McGovern’s research was how to adapt the powerful TRM approach to fit within the competition’s stringent compute limits. The paper demonstrates that by starting with a TRM model that has already been pre-trained on public ARC tasks, one can efficiently fine-tune it on new competition tasks. This test-time adaptation significantly accelerates the training process.

The pre-training involved a model trained on 1,280 public tasks for over 700,000 optimizer steps across 48 hours, achieving a 10% score on the public evaluation set. Crucially, this pre-trained model was then post-trained (fine-tuned) in just 12,500 gradient steps during the competition, leading to a score of 6.67% on semi-private evaluation tasks. This performance was achieved through full fine-tuning of the tiny model, rather than less comprehensive methods like LoRA fine-tuning or fine-tuning only task embeddings.

Methodology and Key Findings

The research explored various pre-training and post-training strategies. Three different pre-trained models were developed: one replicating the original TRM paper, another with an expanded dataset and longer training, and a third with a smaller dataset filtered for harder ARC AGI II tasks. For post-training, methods included full fine-tuning, fine-tuning only embeddings, a combination of embeddings-first then full fine-tuning, and LoRA with embeddings fine-tuning.

The results clearly showed that pre-training a TRM from scratch is not feasible within the competition’s compute limits, yielding near-zero scores. In contrast, fine-tuning a pre-trained model dramatically reduced the number of optimizer steps required to achieve performance improvements. The most effective strategy was full fine-tuning, or a combination where embeddings were fine-tuned first, followed by full fine-tuning. Interestingly, training on a smaller, harder dataset for longer did not yield better results than training on a larger, more diverse dataset, suggesting the importance of data diversity in pre-training.

The paper also delves into the architecture of TRMs, particularly the use of augmentation-specific task embeddings. While these embeddings contribute significantly to the model’s overall parameter count, they proved more effective than explicitly encoding augmentation types, hinting at a useful form of generalization forced by this design choice.

Also Read:

Implications and Future Directions

This research highlights the effectiveness of test-time adaptation for complex abstract reasoning challenges under tight compute constraints. It demonstrates that even small models can achieve meaningful performance by leveraging extensive pre-training followed by efficient fine-tuning. The findings suggest that the pre-trained network effectively learns a ‘landscape’ of problem-solving primitives, which can then be quickly adapted to new, unseen tasks.

Future work could explore optimizing model dimensions, hyperparameter tuning, and investigating alternative augmentation strategies to further push performance boundaries. The full research paper provides a comprehensive look into these methods and results. You can read the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Efficiency: Test-Time Adaptation for Tiny Recursive Models in the ARC Prize

The Rise of Tiny Recursive Models (TRM)

Bridging the Compute Gap: Test-Time Adaptation

Methodology and Key Findings

Implications and Future Directions

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates