CoMAS: Enabling LLM Agents to Learn and Evolve Through Collaborative Interaction

TLDR: CoMAS is a new framework that allows large language model (LLM) agents to improve their capabilities autonomously by learning from their interactions with each other, rather than relying on external rewards. It uses an “LLM-as-a-judge” mechanism to generate intrinsic rewards from discussions (solution, evaluation, scoring) and optimizes agent policies using reinforcement learning. Experiments show CoMAS consistently outperforms untrained agents and scales well with more and diverse agents.

A new research paper introduces CoMAS, a novel framework designed to help large language model (LLM)-based agents continuously improve their abilities. Unlike traditional methods that rely on external rewards or internal signals from individual LLMs, CoMAS enables agents to learn and evolve through their interactions with each other, much like humans learn through discussion and collaboration.

The concept of self-evolution is crucial for LLM-based agents, allowing them to enhance their capabilities over time rather than remaining static after their initial training. Previous approaches often involved expanding knowledge bases, combining multiple agents, or optimizing task workflows. While these methods offered some improvements, their effectiveness was limited by the fixed capabilities of the underlying models.

Reinforcement Learning (RL) has emerged as a promising avenue for agent self-evolution. Existing RL-based methods typically fall into two categories: those that use external rewards (like rule-based verifiers or specialized reward models) and those that extract intrinsic rewards from the LLMs themselves (based on factors like self-certainty or confidence). However, these approaches often focus on individual model improvement.

CoMAS takes a different path, drawing inspiration from how human intelligence develops collectively through diverse interactions. The framework addresses the question of whether LLM-based agents can achieve self-evolution purely by learning from inter-agent interactions within a multi-agent system, without needing external reward signals.

The CoMAS framework is built on three core components:

Interaction

This phase generates rich conversational data through collaborative and critical discussions. Agents propose solutions, evaluate existing solutions, and score them. The environment consists of multiple agents, each with its own policy, allowing for a diverse system where agents can be based on different foundation models. Agents are randomly selected to contribute, ensuring balanced training. The interaction patterns include generating solutions to a question, providing critical evaluations of solutions (explicitly prompted to find flaws to mitigate bias), and scoring solutions based on evaluations.

Reward Formulation

To create a learning signal, CoMAS uses an “LLM-as-a-judge” mechanism. For each solution and evaluation pair, a scoring agent assigns a score (1 to 3) based on predefined semantics: 3 for a correct solution with unhelpful evaluation, 2 for a mostly correct solution with minor flaws, and 1 for an incorrect solution with fatal mistakes. These scores are then normalized to a range of and used to compute complementary rewards for the solution and the evaluation. This creates a zero-sum game, encouraging both correctness and critical thinking. A penalty is also applied if the scoring agent’s output format is invalid, promoting adherence to format while maintaining neutrality.

Also Read:

Policy Optimization

CoMAS uses the REINFORCE++ algorithm to update each agent’s policy. This algorithm is well-suited for diverse interaction patterns. Experiences from each agent’s role (solver, evaluator, scorer) are collected into a replay buffer. The objective is based on token-level credit assignment, where an advantage is calculated for each token, considering the trajectory-level reward and a KL-divergence term to regularize the policy. These advantages are then normalized to stabilize updates, and a surrogate objective is used to improve the policy, encouraging actions that lead to higher advantages within a trusted region.

The researchers evaluated CoMAS across various benchmarks in both single-agent and multi-agent settings. Results showed that CoMAS consistently outperformed untrained agents and achieved state-of-the-art performance in most evaluation scenarios. For instance, it delivered significant gains in setups like Vanilla, Consistency, AutoGen, and Debate. The training dynamics revealed that agents’ response lengths grew, indicating improved capabilities, and rewards remained stable, confirming the effectiveness of the adversarial interaction reward design.

Ablation studies further confirmed the importance of the interaction-based reward formulation, showing that removing evaluation or scoring steps led to performance degradation or undesirable reward dynamics like “reward hacking.” The framework also demonstrated promising scalability, with performance generally improving as the number of agents increased. Furthermore, using heterogeneous agents (based on different foundation models) consistently outperformed homogeneous agents, suggesting that diverse knowledge enhances overall performance.

In conclusion, CoMAS presents a new and effective way for LLM-based agents to self-evolve by learning from their interactions, without needing external supervision. This approach aligns more closely with how human intelligence develops through collaboration and critical discussion. The code and dataset for CoMAS are available for further research and replication. You can find the full research paper here: CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

CoMAS: Enabling LLM Agents to Learn and Evolve Through Collaborative Interaction

Interaction

Reward Formulation

Policy Optimization

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates