TLDR: This research explores using Large Language Models (LLMs) in multi-agent systems to simulate decision conferences and detect agreement among participants. The study evaluates various LLMs for their ability to identify stances and sentiment, finding that LLMs can reliably detect agreement, even smaller, open-source models. Incorporating an “agreement-detection agent” significantly improves the efficiency and quality of simulated debates, making them comparable to real-world decision conferences.
Decision conferences are structured meetings where experts from various fields come together to tackle complex problems and reach a shared understanding or consensus on future actions. These meetings often rely on skilled facilitators to guide discussions and ensure productive dialogue.
Recently, Large Language Models (LLMs) have shown great potential in simulating real-world scenarios, especially through multi-agent systems that mimic group interactions. This new research introduces a novel LLM-based multi-agent system designed to simulate these decision conferences, with a specific focus on identifying when participant agents reach an agreement.
The researchers evaluated six different LLMs on two key tasks: stance detection, which identifies an agent’s position on an issue, and stance polarity detection, which determines if that position is positive, negative, or neutral. These models were then assessed within the multi-agent system to see how effective they were in complex simulations.
The findings indicate that LLMs can reliably detect agreement, even in dynamic and nuanced debates. A significant discovery was that incorporating a dedicated ‘agreement-detection agent’ within the system can greatly improve the efficiency of group debates and enhance the overall quality and coherence of discussions. This makes the simulated conferences comparable to real-world decision conferences in terms of their outcomes and decision-making processes.
How the Simulation Works
The simulated decision conference system involves several types of LLM agents: a moderator agent, participant agents, and a judge agent. The moderator initiates the discussion by presenting an issue. Participant agents then debate the issue, offering their perspectives. After their contributions, the judge agent steps in to determine if an agreement has been reached. If not, the moderator prompts further debate. This cycle continues through stages like issue discussion, model building, and result exploration, with the judge agent signaling when to move forward.
An additional evaluator agent provides scores for the debate based on criteria like clarity, relevance, conciseness, politeness, engagement, flow, coherence, responsiveness, language use, and emotional intelligence. This helps in assessing the quality of the ongoing discussion.
Evaluating the LLM Judge
The evaluation of the LLM judge agents was twofold: objective and subjective. The objective evaluation used established datasets for stance detection and stance polarity detection. Surprisingly, LLMs, even smaller and open-source ones like Gemma 2 9B and LLaMA 3 70B, performed exceptionally well, often outperforming traditional models specifically trained for these tasks without any extensive fine-tuning.
The subjective evaluation involved using ChatGPT 4 as an ‘LLM-as-a-judge’ to assess how well the other LLMs detected agreement in simulated decision conferences. The results mirrored the objective findings, with Gemma 2 9B, LLaMA 3 70B, and ChatGPT 4 consistently being the top performers. This suggests that high-performing LLMs are effective not only on benchmark datasets but also in complex decision scenarios requiring a deeper understanding of context and discussion dynamics.
The Impact of Agreement Detection
A crucial part of the research involved comparing simulations with and without the judge agent. In simulations without the judge agent, debates sometimes progressed too quickly, missing important aspects of the topic. However, with the judge agent, the system ensured that all relevant perspectives were explored before moving on, leading to more thorough and balanced discussions. For example, in a simulated debate about drug policy criteria, the judge agent ensured that the ‘public implications’ cluster, initially overlooked, was addressed, aligning the simulation’s outcome with a real-world conference.
This demonstrates that the judge agent not only enhances the depth of debates but also helps the moderator determine the appropriate time to transition between topics, significantly improving the overall quality of the discourse. For more details, you can read the full research paper here.
Also Read:
- Evaluating Large Language Models for Argument Classification: A Deep Dive into Performance and Pitfalls
- AI Agents Reshaping Conceptual Engineering Design with Structured Language Models
Future Directions
While promising, the use of LLMs as judge agents comes with challenges. Their accuracy across all topics can be uncertain, and they sometimes overlook parts of the prompt. Future research could explore prompt engineering techniques or integrate methods like retrieval-augmented generation (RAG) or knowledge graphs (KGs) to ground agents in more relevant information and prevent them from relying on overly advanced or insufficient knowledge, ensuring a more natural and insightful debate flow.


