spot_img
HomeResearch & DevelopmentEnhancing LLM Performance Through Collaborative Test-Time Scaling

Enhancing LLM Performance Through Collaborative Test-Time Scaling

TLDR: Collective Test-Time Scaling (CTTS) is a new method to improve large language models (LLMs) without retraining. It explores three paradigms for combining multiple LLM agents and reward models, finding that “Multiple Agents to Multiple Reward Models” (MA-MR) performs best. The proposed CTTS-MM framework, featuring Agent Collaboration Search (ACS) and Mixture of Reward Models (MoR), significantly outperforms existing methods and leading LLMs on various benchmarks by leveraging collaboration during inference.

In the rapidly evolving landscape of large language models (LLMs), a new approach called Collective Test-Time Scaling (CTTS) is emerging as a significant advancement. This innovative method aims to boost the performance of LLMs without the need for extensive and costly retraining, focusing instead on optimizing their capabilities during the inference, or “test,” phase.

Traditionally, Test-Time Scaling (TTS) methods, such as “Best-of-N” and “Self-Consistency,” have relied on a “single agent to single reward model” (SA-SR) paradigm. This means a single LLM generates multiple answers, and a single reward model then selects the best one. While effective to a degree, this single-agent approach has inherent limitations, including a constrained upper bound on model capability and a potential bias in output selection.

Drawing inspiration from how humans collaborate to solve complex problems, the researchers behind CTTS propose that orchestrating multiple LLMs can overcome these limitations. Their paper, available at https://arxiv.org/pdf/2508.03333, introduces three primary paradigms for CTTS to explore the optimal way for models to interact:

Exploring CTTS Paradigms

  • Single Agent to Multiple Reward Models (SA-MR): Here, a single LLM generates answers, but multiple reward models collaborate to evaluate and select the best response. This aims to provide more comprehensive and less biased feedback.

  • Multiple Agents to Single Reward Model (MA-SR): In this setup, multiple LLMs generate candidate answers, and a single reward model then chooses the optimal one. This leverages the diversity of outputs from different agents.

  • Multiple Agents to Multiple Reward Models (MA-MR): This paradigm combines the strengths of both multi-agent generation and multi-reward-model evaluation. Multiple LLMs produce answers, and multiple reward models work together to select the best among them.

Extensive experiments conducted across various benchmarks consistently showed that the MA-MR paradigm achieved the best performance. This highlights the critical role of both multi-agent and multi-reward-model collaboration in enhancing LLM inference.

Building on this finding, the researchers propose a novel framework called CTTS-MM (Collective Test-Time Scaling with Multiple agents to Multiple reward models). CTTS-MM introduces two key components:

Also Read:

Key Components of CTTS-MM

  • Agent Collaboration Search (ACS): This component dynamically searches for the most effective combination of LLM agents from a large pool of candidates. It’s designed to find the best ensemble of models for a given task.

  • Mixture of Reward Models (MoR): To provide high-quality feedback for the ACS process, MoR consists of a curated question pool and a Prior Reward model Ensemble Selection (PRES). PRES uses a Pair-wise Reward Ranking (PRR) metric to adaptively select the optimal reward model or a weighted combination of them based on the specific question.

The effectiveness of CTTS-MM was rigorously tested on seven mainstream benchmarks, involving ten open-source LLMs and eight reward models. The results were impressive: CTTS-MM consistently outperformed existing TTS methods, other collaboration approaches, and even leading proprietary LLMs like GPT-4.1 and Claude-3.7-sonnet. For instance, CTTS-MM showed a significant improvement of +4.82% over Best of N, +7.06% over GPT-4.1, and +7.76% over DeepSeek-R1-Distill-Qwen-32B.

This research marks a significant step towards formalizing and analyzing collective test-time scaling. It demonstrates that by strategically combining multiple LLM agents and multiple reward models, it’s possible to unlock the full potential of pre-trained LLMs during inference, leading to superior performance across diverse tasks without additional training costs.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -