TLDR: This research introduces Multi-Agent Group Relative Policy Optimization (MAGRPO), a novel algorithm that models LLM collaboration as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. By fine-tuning LLMs with MAGRPO, agents learn to generate high-quality responses efficiently through effective cooperation. Experiments in writing (summarization, article expansion) and coding (Python function generation) demonstrate that MAGRPO significantly outperforms existing methods in terms of efficiency and quality, enabling diverse and effective cooperation schemes among LLMs.
Large Language Models (LLMs) have shown incredible abilities across many fields, from writing to coding. However, these powerful models are typically trained in isolation, meaning they aren’t inherently designed to work together effectively. When multiple LLMs need to collaborate on a complex task, current methods often fall short, relying on individual rewards that can make it hard to encourage true teamwork. This often leads to issues like conflicting outputs or inefficient communication.
To tackle this challenge, researchers have proposed a novel approach that views LLM collaboration as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. This framework, formalized as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP), allows multiple trainable LLMs to generate responses together, with the environment evolving based on their combined actions. The key idea is to use a joint reward system, which encourages agents to specialize and cooperate without needing complex individual reward designs or prompt engineering.
Introducing MAGRPO: A New Algorithm for LLM Teamwork
At the heart of this new approach is an algorithm called Multi-Agent Group Relative Policy Optimization (MAGRPO). Building on existing reinforcement learning techniques for LLMs and MARL methods, MAGRPO trains LLMs in a multi-turn setting. It leverages centralized group-relative advantages for joint optimization while allowing each agent to execute its actions independently. This balance between centralized training and decentralized execution is crucial for scalability and performance in complex tasks.
Real-World Collaboration: Writing and Coding
The effectiveness of MAGRPO was tested in two primary collaboration scenarios: writing and coding.
Writing Collaboration
In writing tasks, MAGRPO was applied to summarization and article expansion. For summarization, two Qwen3-1.7B agents worked together to produce both a concise TLDR summary and a more detailed one from Reddit posts. For article expansion, two agents collaborated to generate an introduction for arXiv papers, with one focusing on background and motivation, and the other on methods and implications. The quality was measured by structural wellness, style consistency, and logical coherence.
The results were impressive. MAGRPO-trained agents showed a significant improvement in both efficiency and quality compared to single-agent models and other multi-agent baselines that relied solely on prompt-level interactions. For instance, MAGRPO was three times faster than a comparable single model and produced much more coherent and well-structured text. This demonstrates that fine-tuning LLMs with MAGRPO enables them to generate high-quality responses efficiently through effective cooperation.
Coding Collaboration
For coding tasks, two Qwen2.5-Coder-3B agents collaborated to generate Python functions. One agent acted as a helper, producing auxiliary functions, while the other generated the main function. The code quality was evaluated based on structural integrity, syntactic correctness, test pass rate, and a cooperation quality bonus.
Experiments on the HumanEval and CoopHumanEval datasets showed that MAGRPO significantly improved code quality and cooperation. Multi-turn MAGRPO, in particular, allowed agents to learn from external feedback, leading to better performance over time. The research also highlighted various cooperation schemes that emerged naturally during training, such as:
- Fallback: The main agent provides a backup implementation in case the auxiliary function encounters errors.
- Decorator: The main agent adds complementary features or handles edge cases, trusting the auxiliary for core logic.
- Coordinator: The main agent divides tasks and assigns subtasks to the auxiliary agent, often seen in iterative processes.
- Strategy Filter: The auxiliary agent acts as a filter for specific logic branches, guiding the main agent’s implementation within conditional blocks.
These emergent schemes demonstrate the flexibility and adaptability of agents trained with MAGRPO, allowing them to develop sophisticated collaborative behaviors.
Also Read:
- SEAgent: An AI Framework for Autonomous Software Proficiency
- Bridging the Gap: The Urgent Need for Standardized Communication Protocols Among AI Agents
Looking Ahead
This research marks a significant step towards more robust and scalable LLM collaboration. By modeling LLM teamwork as a cooperative MARL problem and introducing the MAGRPO algorithm, the authors have opened new avenues for enhancing multi-agent LLM systems. This approach could lead to more modular and efficient AI systems, where specialized agents work together seamlessly to solve complex problems that would be challenging for a single model. For more details, you can read the full research paper here.


