spot_img
HomeResearch & DevelopmentCoMAS: Enabling LLM Agents to Learn and Evolve Through...

CoMAS: Enabling LLM Agents to Learn and Evolve Through Collaborative Interaction

TLDR: CoMAS is a new framework that allows large language model (LLM) agents to improve their capabilities autonomously by learning from their interactions with each other, rather than relying on external rewards. It uses an “LLM-as-a-judge” mechanism to generate intrinsic rewards from discussions (solution, evaluation, scoring) and optimizes agent policies using reinforcement learning. Experiments show CoMAS consistently outperforms untrained agents and scales well with more and diverse agents.

A new research paper introduces CoMAS, a novel framework designed to help large language model (LLM)-based agents continuously improve their abilities. Unlike traditional methods that rely on external rewards or internal signals from individual LLMs, CoMAS enables agents to learn and evolve through their interactions with each other, much like humans learn through discussion and collaboration.

The concept of self-evolution is crucial for LLM-based agents, allowing them to enhance their capabilities over time rather than remaining static after their initial training. Previous approaches often involved expanding knowledge bases, combining multiple agents, or optimizing task workflows. While these methods offered some improvements, their effectiveness was limited by the fixed capabilities of the underlying models.

Reinforcement Learning (RL) has emerged as a promising avenue for agent self-evolution. Existing RL-based methods typically fall into two categories: those that use external rewards (like rule-based verifiers or specialized reward models) and those that extract intrinsic rewards from the LLMs themselves (based on factors like self-certainty or confidence). However, these approaches often focus on individual model improvement.

CoMAS takes a different path, drawing inspiration from how human intelligence develops collectively through diverse interactions. The framework addresses the question of whether LLM-based agents can achieve self-evolution purely by learning from inter-agent interactions within a multi-agent system, without needing external reward signals.

The CoMAS framework is built on three core components:

Interaction

This phase generates rich conversational data through collaborative and critical discussions. Agents propose solutions, evaluate existing solutions, and score them. The environment consists of multiple agents, each with its own policy, allowing for a diverse system where agents can be based on different foundation models. Agents are randomly selected to contribute, ensuring balanced training. The interaction patterns include generating solutions to a question, providing critical evaluations of solutions (explicitly prompted to find flaws to mitigate bias), and scoring solutions based on evaluations.

Reward Formulation

To create a learning signal, CoMAS uses an “LLM-as-a-judge” mechanism. For each solution and evaluation pair, a scoring agent assigns a score (1 to 3) based on predefined semantics: 3 for a correct solution with unhelpful evaluation, 2 for a mostly correct solution with minor flaws, and 1 for an incorrect solution with fatal mistakes. These scores are then normalized to a range of and used to compute complementary rewards for the solution and the evaluation. This creates a zero-sum game, encouraging both correctness and critical thinking. A penalty is also applied if the scoring agent’s output format is invalid, promoting adherence to format while maintaining neutrality.

Also Read:

Policy Optimization

CoMAS uses the REINFORCE++ algorithm to update each agent’s policy. This algorithm is well-suited for diverse interaction patterns. Experiences from each agent’s role (solver, evaluator, scorer) are collected into a replay buffer. The objective is based on token-level credit assignment, where an advantage is calculated for each token, considering the trajectory-level reward and a KL-divergence term to regularize the policy. These advantages are then normalized to stabilize updates, and a surrogate objective is used to improve the policy, encouraging actions that lead to higher advantages within a trusted region.

The researchers evaluated CoMAS across various benchmarks in both single-agent and multi-agent settings. Results showed that CoMAS consistently outperformed untrained agents and achieved state-of-the-art performance in most evaluation scenarios. For instance, it delivered significant gains in setups like Vanilla, Consistency, AutoGen, and Debate. The training dynamics revealed that agents’ response lengths grew, indicating improved capabilities, and rewards remained stable, confirming the effectiveness of the adversarial interaction reward design.

Ablation studies further confirmed the importance of the interaction-based reward formulation, showing that removing evaluation or scoring steps led to performance degradation or undesirable reward dynamics like “reward hacking.” The framework also demonstrated promising scalability, with performance generally improving as the number of agents increased. Furthermore, using heterogeneous agents (based on different foundation models) consistently outperformed homogeneous agents, suggesting that diverse knowledge enhances overall performance.

In conclusion, CoMAS presents a new and effective way for LLM-based agents to self-evolve by learning from their interactions, without needing external supervision. This approach aligns more closely with how human intelligence develops through collaboration and critical discussion. The code and dataset for CoMAS are available for further research and replication. You can find the full research paper here: CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -