TLDR: A new method called Multi-Agent Consensus Alignment (MACA) helps language models (LMs) become more self-consistent. LMs often give contradictory answers, but MACA uses a reinforcement learning framework where multiple LM copies debate to solve problems. By learning from the consensus (majority) and dissenting (minority) reasoning paths, models internalize stable reasoning patterns. This leads to significant improvements in self-consistency, single-agent accuracy, sampling-based inference, and multi-agent decision-making, even generalizing to new tasks without external supervision.
Language models (LMs) are incredibly powerful, but they often struggle with a fundamental issue: inconsistency. Imagine asking an AI the same question twice and getting two different, sometimes contradictory, answers. This isn’t ideal for reliable reasoning. While existing methods try to fix these inconsistencies during the inference stage (when the model is generating an answer), they don’t address the root cause: the models themselves aren’t internally aligned to consistently choose the best reasoning paths.
A new research paper introduces an innovative solution called Multi-Agent Consensus Alignment (MACA). This framework uses reinforcement learning to post-train language models, teaching them to favor reasoning processes that lead to consistent outcomes. The core idea is to formalize self-consistency as an intrinsic property, meaning the model learns to be consistent from within, rather than relying on external fixes.
MACA works by having multiple copies, or ‘clones,’ of a language model engage in an iterative debate. These agents collaborate to solve problems, first exploring solutions independently, then refining their reasoning by interacting with their peers. Crucially, it’s not just about the final answer; the entire reasoning paths exchanged during these debates provide rich training signals. The framework identifies ‘consensus-supporting’ trajectories (where agents agree) and ‘dissenting’ trajectories (where they disagree). By learning to distinguish between these, the model internalizes the subtle differences between stable, consistent reasoning and flawed, inconsistent reasoning.
This self-supervised approach means MACA doesn’t need external human supervision. Agents teach themselves to be more decisive and concise, and to better leverage insights from their peers in multi-agent settings. The results are quite impressive. MACA has shown substantial improvements across several key areas:
- Self-consistency: A significant boost of up to 27.6% on the GSM8K benchmark.
- Single-agent reasoning: Performance increased by 23.7% on the MATH dataset.
- Sampling-based inference: A 22.4% improvement in Pass@20 on MATH.
- Multi-agent ensemble decision-making: A remarkable 42.7% increase on MathQA.
Beyond these specific benchmarks, MACA also demonstrates strong generalization capabilities, meaning the models perform better on tasks they haven’t seen before. For instance, there were improvements of 16.3% on GPQA and 11.6% on CommonsenseQA. This suggests that self-consistency is a foundational capability that enhances general reasoning across diverse domains.
The researchers found that multi-agent debate generates more informative training signals compared to simpler methods like single-round majority voting. Furthermore, addressing consensus alignment through preference learning (using methods like MV-DPO and MV-KTO) yielded superior results compared to scalar-reward reinforcement learning or imitation learning. This is akin to how humans form preferences through relative comparison, where majority opinions provide guidance while minority views introduce necessary variation.
An interesting finding from the ablation studies is that the self-generated consensus signals from the debate are comparable to, and sometimes even outperform, supervision from ground-truth labels. This highlights the power of self-supervised alignment. Additionally, incorporating peer context during training significantly improves both collective and individual reasoning, as agents learn to effectively utilize each other’s arguments.
Also Read:
- Slim-SC: Enhancing LLM Reasoning Efficiency Through Intelligent Thought Pruning
- Mapping LLM Reasoning: A Graph-Based Approach to Confidence Estimation
While MACA requires a certain level of baseline competence from the language model to generate meaningful consensus signals, it represents a significant step towards more robust and reliable AI reasoning. It shows that language models can effectively use internal deliberation to self-align, enhancing their reasoning capabilities autonomously. For more in-depth details, you can read the full paper here.


