Enhancing LLM Performance Through Collaborative Test-Time Scaling

TLDR: Collective Test-Time Scaling (CTTS) is a new method to improve large language models (LLMs) without retraining. It explores three paradigms for combining multiple LLM agents and reward models, finding that “Multiple Agents to Multiple Reward Models” (MA-MR) performs best. The proposed CTTS-MM framework, featuring Agent Collaboration Search (ACS) and Mixture of Reward Models (MoR), significantly outperforms existing methods and leading LLMs on various benchmarks by leveraging collaboration during inference.

In the rapidly evolving landscape of large language models (LLMs), a new approach called Collective Test-Time Scaling (CTTS) is emerging as a significant advancement. This innovative method aims to boost the performance of LLMs without the need for extensive and costly retraining, focusing instead on optimizing their capabilities during the inference, or “test,” phase.

Traditionally, Test-Time Scaling (TTS) methods, such as “Best-of-N” and “Self-Consistency,” have relied on a “single agent to single reward model” (SA-SR) paradigm. This means a single LLM generates multiple answers, and a single reward model then selects the best one. While effective to a degree, this single-agent approach has inherent limitations, including a constrained upper bound on model capability and a potential bias in output selection.

Drawing inspiration from how humans collaborate to solve complex problems, the researchers behind CTTS propose that orchestrating multiple LLMs can overcome these limitations. Their paper, available at https://arxiv.org/pdf/2508.03333, introduces three primary paradigms for CTTS to explore the optimal way for models to interact:

Exploring CTTS Paradigms

Single Agent to Multiple Reward Models (SA-MR): Here, a single LLM generates answers, but multiple reward models collaborate to evaluate and select the best response. This aims to provide more comprehensive and less biased feedback.
Multiple Agents to Single Reward Model (MA-SR): In this setup, multiple LLMs generate candidate answers, and a single reward model then chooses the optimal one. This leverages the diversity of outputs from different agents.
Multiple Agents to Multiple Reward Models (MA-MR): This paradigm combines the strengths of both multi-agent generation and multi-reward-model evaluation. Multiple LLMs produce answers, and multiple reward models work together to select the best among them.

Extensive experiments conducted across various benchmarks consistently showed that the MA-MR paradigm achieved the best performance. This highlights the critical role of both multi-agent and multi-reward-model collaboration in enhancing LLM inference.

Building on this finding, the researchers propose a novel framework called CTTS-MM (Collective Test-Time Scaling with Multiple agents to Multiple reward models). CTTS-MM introduces two key components:

Also Read:

Key Components of CTTS-MM

Agent Collaboration Search (ACS): This component dynamically searches for the most effective combination of LLM agents from a large pool of candidates. It’s designed to find the best ensemble of models for a given task.
Mixture of Reward Models (MoR): To provide high-quality feedback for the ACS process, MoR consists of a curated question pool and a Prior Reward model Ensemble Selection (PRES). PRES uses a Pair-wise Reward Ranking (PRR) metric to adaptively select the optimal reward model or a weighted combination of them based on the specific question.

The effectiveness of CTTS-MM was rigorously tested on seven mainstream benchmarks, involving ten open-source LLMs and eight reward models. The results were impressive: CTTS-MM consistently outperformed existing TTS methods, other collaboration approaches, and even leading proprietary LLMs like GPT-4.1 and Claude-3.7-sonnet. For instance, CTTS-MM showed a significant improvement of +4.82% over Best of N, +7.06% over GPT-4.1, and +7.76% over DeepSeek-R1-Distill-Qwen-32B.

This research marks a significant step towards formalizing and analyzing collective test-time scaling. It demonstrates that by strategically combining multiple LLM agents and multiple reward models, it’s possible to unlock the full potential of pre-trained LLMs during inference, leading to superior performance across diverse tasks without additional training costs.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing LLM Performance Through Collaborative Test-Time Scaling

Exploring CTTS Paradigms

Key Components of CTTS-MM

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates