TLDR: This research introduces Agent-REINFORCE, a novel framework that optimizes how multiple Large Language Models (LLMs) collaborate during inference (test-time scaling) by treating their interactions as an optimizable graph. It addresses the limitations of fixed architectures and single-model usage by dynamically searching for compute-optimal model combinations and topologies under a fixed budget. Guided by three empirical insights on model preferences, scaling limits, and width-depth interdependence, Agent-REINFORCE uses an LLM-agent to efficiently explore the vast design space, outperforming traditional and LLM-based baselines in accuracy, efficiency, and multi-objective optimization (e.g., balancing accuracy and latency).
Large Language Models (LLMs) have become incredibly powerful, but getting the most out of them during inference – the “test-time” phase – often requires careful allocation of computational resources. This process, known as Test-Time Scaling (TTS), traditionally involves using fixed architectures, like simple parallel or sequential processing, and often relies on a single LLM. However, new research highlights a significant limitation: these fixed approaches aren’t always the best fit, as the ideal setup can change dramatically depending on the specific task at hand.
A groundbreaking paper titled “GENERALIZING TEST-TIME COMPUTE-OPTIMAL SCALING AS AN OPTIMIZABLE GRAPH” by Fali Wang, Jihai Chen, Shuhua Yang, Runxue Bao, Tianxiang Zhao, Zhiwei Zhang, Xianfeng Tang, Hui Liu, Qi He, and Suhang Wang, introduces a novel approach to address this challenge. Their work focuses on finding the most compute-optimal combinations of LLM models and collaboration architectures under a fixed budget. Instead of static designs, they propose a dynamic system where multiple LLMs work together in a flexible “collaboration graph.”
Imagine a network where each point (node) represents an LLM with a specific role – perhaps an “assistant” refining an output or a “fuser” combining multiple outputs. The connections (edges) show how information flows between these LLMs. This graph-based view allows for highly adaptable and task-specific designs, moving beyond the limitations of predefined structures. The challenge, however, is immense: the sheer number of possible graph configurations is astronomically large, making a brute-force search impossible. Furthermore, each task has unique requirements, demanding a tailored design.
To tackle this, the researchers reformulated the problem as a probabilistic graph optimization. Through initial experiments, they uncovered three crucial insights into how LLMs collaborate effectively:
Insight 1: Task-Specific Model Preferences
The study found that different tasks have distinct preferences for LLM families and sizes. For instance, replicating the strongest available model family is generally more effective than mixing different families. Also, for reasoning tasks like complex math problems, ensembles of smaller models often perform better, allowing for iterative refinement. In contrast, knowledge-intensive tasks, such as general understanding questions, tend to benefit more from a single, larger LLM that offers broader knowledge coverage.
Insight 2: Optimal Limits for Scaling
Both parallel (increasing the “width” of the graph by running more LLMs simultaneously) and sequential (increasing the “depth” by having LLMs refine outputs iteratively) scaling show a non-monotonic trend. Performance improves up to a certain point, a task-dependent optimum, and then either plateaus or even declines. Beyond this optimal point, adding more computation can lead to diminishing returns or even negative effects, such as error amplification in sequential scaling or context overload in parallel scaling.
Also Read:
- GraphChain: A New Approach for Large Language Models to Analyze Complex Graph Data
- GLM: Enhancing LLM Reasoning on Graphs with Multi-Agent Collaboration and Optimized Serving
Insight 3: Interdependence of Width and Depth
The research revealed that the graph’s width and depth are not independent. An increase in one dimension can shift the optimal point of the other. For example, a wider graph might require less depth for optimal performance, and vice-versa. This highlights the need for a holistic approach to designing these collaboration graphs.
Guided by these insights, the team developed “Agent-REINFORCE,” an innovative framework that uses an LLM-based agent to efficiently search for optimal multi-LLM collaboration graphs. This framework mirrors the REINFORCE algorithm, but instead of traditional gradients, it uses “textual feedback” to update the probabilistic graph. The Agent-REINFORCE system has three main components: the Agent (an LLM that initializes, samples, and updates the graph), the Archive (which records results), and the Environment (which evaluates candidate graphs).
The Agent leverages the insights to intelligently initialize promising model combinations and then iteratively refines the graph structure. For example, Insight 1 guides the initial selection of model families and sizes, while Insights 2 and 3 inform how the agent adjusts the graph’s width and depth during the optimization process. This allows the system to efficiently explore the vast design space, pruning less promising configurations early on.
Experiments demonstrated that Agent-REINFORCE significantly outperforms both traditional optimization methods and other LLM-based baselines. It achieves higher accuracy and faster convergence, effectively identifying optimal graphs not just for performance, but also for joint objectives like balancing accuracy with inference latency. The method also proved robust across different budget metrics, including FLOPs and monetary cost.
This research marks a significant step forward in optimizing LLM performance during inference. By treating LLM collaboration as an optimizable graph and leveraging empirical insights with an intelligent agent, we can unlock more efficient and powerful ways for LLMs to tackle complex tasks. For more technical details, you can read the full paper here.


