TLDR: Maestro is a novel framework that jointly optimizes the structural design (graph) and operational configurations (prompts, models, tools) of AI agents. Unlike previous methods that only tune configurations, Maestro addresses fundamental structural flaws, leading to more reliable and efficient agents. It leverages both numeric and reflective textual feedback to guide its optimization, achieving significant performance improvements on benchmarks and real-world applications like interviewer and RAG agents, often with fewer training steps.
The field of Artificial Intelligence is rapidly advancing, with Large Language Models (LLMs) enabling a new paradigm of AI agents that can autonomously plan and act to accomplish complex tasks. These agents aim to reduce human intervention by converting high-level instructions into multi-step decisions and tool calls. However, despite their promise, current AI agents often fall short in delivering reliable results, frequently encountering limitations such as poor instruction following, unanticipated failures in unusual scenarios, mismanagement of global state, architectural fragility, and weak error handling.
A new research paper introduces Maestro, a novel framework-agnostic optimizer designed to address these challenges by taking a holistic approach to AI agent design. Most existing methods for improving AI agents focus solely on tuning configurations—like prompts, models, and tools—while keeping the underlying structure, or ‘graph’, of the agent fixed. This leaves many fundamental structural failure modes unaddressed.
Maestro’s Holistic Approach
Maestro stands out by jointly optimizing both the agent’s graph (which modules exist and how information flows between them) and the configuration of each node within that graph (models, prompts, tools, and control parameters). This dual-level optimization allows Maestro to tackle structural deficiencies that prompt tuning alone cannot fix.
The framework operates through two complementary steps:
- C-step (Configuration Update): In this step, Maestro keeps the agent’s graph fixed and focuses on tuning the configurations of its components. This involves optimizing elements like prompts, model choices, and hyperparameters to improve task performance.
- G-step (Graph Update): Here, Maestro proposes and implements small structural edits to the agent’s graph, such as adding, removing, or rewiring nodes and edges. These changes can introduce new capabilities, like persistent memory or conditional routing, to address deeper architectural flaws.
A key innovation of Maestro is its ability to leverage reflective textual feedback from execution traces, in addition to numeric metrics. This qualitative feedback helps prioritize edits, significantly improving sample efficiency and allowing the optimizer to target specific failure modes like instruction drift, looping, or state loss.
Also Read:
- Making Sense of AI Actions: TalkToAgent’s Approach to Explaining Reinforcement Learning
- Structuring Intelligence: Language Models Crafting Hierarchical Learning Environments for AI Agents
Performance and Applications
The research demonstrates Maestro’s effectiveness across various benchmarks and real-world applications. On the IFBench and HotpotQA benchmarks, Maestro consistently outperformed leading prompt optimizers such as MIPROv2, GEPA, and GEPA+Merge. Even when restricted to prompt-only optimization, Maestro showed superior results, and these improvements were further amplified when graph optimization was included. Notably, Maestro achieved these gains with significantly fewer rollouts (training steps) compared to its predecessors.
Two practical applications further highlight Maestro’s capabilities:
- Interviewer Agent: For a financial interviewer agent designed to collect information from customers following a predefined structure, the initial design had a very low completion rate. Maestro, through configuration-only optimization, boosted this rate significantly. With joint graph and configuration optimization, the complete rate soared even higher. A crucial graph modification was the addition of an external state variable, ‘branches_done’, to explicitly track completed conversation branches, preventing the agent from getting stuck or missing information.
- RAG Agent: In a Retrieval-Augmented Generation (RAG) agent for financial question-answering, Maestro improved performance substantially. The optimized design included new tools for numeric computations (like mean, standard deviation, and percentage growth). This structural change offloaded complex calculations from the LLM, making the agent faster, more cost-effective, and less prone to errors.
These results underscore that structural changes can enable entirely new computations and eliminate whole classes of errors, while configuration tuning refines how well those computations are performed. Optimizing both simultaneously is crucial for building robust and efficient AI agents.
Maestro offers a disciplined path to creating task-specific agents that are not only more accurate but also more controllable and cost-aware. By integrating structural exploration with configuration exploitation and utilizing rich feedback, it provides a practical blueprint for developing reliable AI agents. You can read the full technical report here.


