TLDR: ArchPilot is a multi-agent system for automated machine learning engineering that significantly reduces computational costs and speeds up development. It uses three specialized agents—Orchestration, Generation, and Evaluation—to efficiently explore ML pipeline designs. The Evaluation Agent employs fast, proxy-based evaluations with adaptive reweighting, minimizing reliance on expensive full training runs. Experiments show ArchPilot outperforms existing methods, especially on complex tasks, by intelligently prioritizing high-potential candidates under limited budgets.
The field of machine learning engineering is constantly evolving, with a growing demand for automated systems that can design and optimize complex ML pipelines. Traditionally, this process has been resource-intensive, often requiring numerous full training runs to evaluate different model architectures and hyperparameters. This approach leads to significant computational costs, limits the exploration of vast solution spaces, and slows down the development cycle.
Addressing these challenges, researchers have introduced ArchPilot, an innovative multi-agent system designed to streamline machine learning engineering. ArchPilot aims to make the process more efficient and scalable by reducing its reliance on expensive full training runs. It achieves this by integrating architecture generation, proxy-based evaluation, and adaptive search within a unified framework.
How ArchPilot Works: A Collaborative System
ArchPilot operates through the collaboration of three specialized agents, each with a distinct role:
The Orchestration Agent (OA) acts as the system’s coordinator. It manages the overall search process, employing a novel algorithm inspired by Monte Carlo Tree Search (MCTS) that includes a restart mechanism. This agent keeps track of previous candidate solutions and guides the exploration towards promising areas, ensuring efficient use of computational resources.
The Generation Agent (GA) is responsible for creating and refining machine learning architectures. It iteratively generates initial designs, debugs failing pipelines, and proposes incremental improvements to candidate architectures. The GA works by taking context from the Orchestration Agent, such as task descriptions and available resources, to produce runnable scripts.
The Evaluation Agent (EA) is a core component that significantly reduces the need for full training runs. Instead, it executes “proxy training runs,” which are much faster and less resource-intensive. This agent generates and optimizes proxy functions, which are lightweight metrics that can quickly estimate the performance of a candidate architecture. It then aggregates these proxy scores into a performance metric that is aware of how reliable these proxies are. When enough real training data is available, the EA adaptively reweights these proxies to better align with actual performance.
Key Innovations for Efficiency
A central innovation of ArchPilot is its multi-proxy evaluation system with adaptive reweighting. Instead of relying on a single, potentially unreliable heuristic or a costly full training, the Evaluation Agent uses a small set of diverse, inexpensive proxies. These proxies might include one-epoch validation (training for a very short period), noisy validation (adding noise to inputs), and feature-dropout validation (masking input features). By combining these signals, ArchPilot gets a comprehensive yet fast estimate of a candidate’s potential.
As the system gathers more data from occasional full training runs, the Evaluation Agent refines the weights assigned to each proxy, making the aggregated score more accurate. If these weights change significantly, the Orchestration Agent can trigger a “tree restart,” which re-evaluates and re-prioritizes candidates based on the updated scoring system, ensuring the search remains focused on the most promising paths.
Also Read:
- RefAgent: A Multi-Agent AI Framework for Smarter Software Refactoring
- Knowledge-Guided AI Framework for Design Automation
Performance and Impact
Experiments conducted on MLE-Bench, a comprehensive benchmark for machine learning tasks, demonstrate ArchPilot’s effectiveness. It consistently outperforms state-of-the-art baselines like AIDE and ML-Master. For instance, ArchPilot achieved a higher valid submission rate and a better average normalized rank compared to its counterparts. Its advantages were particularly noticeable on high-difficulty tasks, where the cost of full training is prohibitive, highlighting the value of its proxy-guided search.
This multi-agent, proxy-guided approach allows ArchPilot to explore a much larger portion of the solution space under the same computational budget, leading to higher quality solutions and more efficient machine learning engineering. The system’s modular design also allows for independent upgrades and improvements to each agent, ensuring its adaptability and future potential.
For more in-depth information, you can read the full research paper: ArchPilot: A Proxy-Guided Multi-Agent Approach for Machine Learning Engineering.


