Unpacking LLM Agent Collaboration: How Temperature and Persona Influence Consensus and Accuracy in Qualitative Coding

TLDR: A new study investigates how ‘temperature’ (output randomness) and ‘persona’ (assigned traits) affect multi-agent LLM systems in qualitative coding. It found that higher temperatures consistently reduced immediate consensus among agents. While diverse personas sometimes delayed consensus, they did not reliably improve coding accuracy over single-agent LLMs. The research suggests that multi-agent systems’ value might lie in revealing ambiguities and aiding human-AI collaboration rather than simply boosting accuracy.

Large Language Models (LLMs) are transforming many fields, and qualitative research is no exception. These powerful AI tools offer new ways to analyze vast amounts of text data, particularly for tasks like coding and data annotation. While single LLMs can perform these tasks, researchers have been exploring multi-agent systems (MAS) – where multiple LLM agents collaborate – to mimic human coding workflows and potentially enhance reliability and accuracy.

A recent study delved into how two key factors, ‘agent persona’ and ‘temperature,’ influence the consensus-building and coding accuracy of LLM-based multi-agent systems in qualitative coding. Temperature, in this context, refers to a setting that controls the randomness of an LLM’s output, with higher temperatures leading to more varied and unpredictable responses. Agent persona involves imbuing LLM agents with specific traits, such as being neutral, assertive, or empathetic, to see how these ‘personalities’ affect their collaboration.

The researchers conducted an extensive experimental study using six different open-source LLMs, ranging in size from 3 billion to 32 billion parameters. They set up 18 different experimental configurations, analyzing over 77,000 coding decisions. The system they developed mirrored human deductive coding practices, involving structured agent discussions and a consensus arbitration process. The data used for coding consisted of human-annotated transcripts from online math tutoring sessions, based on a codebook with eight distinct categories.

Temperature’s Clear Impact on Consensus

One of the study’s most consistent findings was the significant impact of temperature on whether and when LLM agents reached a consensus. Across all six LLMs, higher temperatures consistently reduced the likelihood of agents agreeing immediately in the first round of coding. Instead, higher temperatures led to more instances of delayed agreement (where agents agreed after a second round of discussion) or, in some cases, no consensus at all, requiring a separate ‘consensus agent’ to make the final decision. This suggests that increasing the randomness in LLM outputs makes it harder for agents to converge quickly on a shared interpretation.

Persona’s Mixed Influence

The influence of agent personas was more nuanced and varied depending on the specific LLM being used. Multi-agent systems with multiple personas (e.g., a mix of neutral, assertive, or empathetic agents) significantly delayed consensus in four out of six LLMs compared to systems where all agents had uniform personas. In three of these LLMs, higher temperatures actually diminished the delaying effects of multiple personas on consensus. This indicates that while diverse personalities can introduce more deliberation, their impact isn’t universal across all LLMs, and high randomness can sometimes override these personality-driven effects.

Surprising Accuracy Results

Perhaps the most striking finding of the study was that neither temperature nor persona pairing led to robust improvements in coding accuracy. In fact, single LLM agents often matched or even outperformed the multi-agent systems in most conditions when compared against the human-annotated ‘gold-standard’ dataset. This challenges the common assumption that multi-agent deliberation inherently leads to better or more accurate outcomes in qualitative coding.

Only one specific model, OpenHermesV2:7B, showed above-chance gains from MAS deliberation, and only for a particular code category called “Guiding Feedback.” These improvements were observed when the temperature was 0.5 or lower, and especially when the agents included at least one assertive persona. This suggests that while multi-agent systems might offer benefits, these are highly specific to the model, task, and configuration, rather than being a general advantage.

Also Read:

Qualitative Insights and Limitations

A qualitative analysis of the multi-agent collaboration for the successful OpenHermesV2:7B configurations revealed that MAS might help in narrowing ambiguous code applications. However, it also highlighted significant issues. Agents often struggled to provide consistent rationales for their coding decisions, sometimes reported previous codings incorrectly, and even hallucinated non-existent data points or proposed new code categories. This indicates that while LLMs can simulate discussion, their ‘reasoning’ and ‘collaboration’ might be more akin to complex prediction rather than genuine human-like sense-making.

The study concludes that while multi-agent LLM configurations do influence consensus-building behavior, they yield minimal improvements in coding accuracy when benchmarked against human data. This suggests a need to reframe how we view LLM-based MAS. Instead of seeing them as tools to maximize coding accuracy or perfectly replicate human consensus, they could be more valuable as collaborators that help surface ambiguities, propose new categories, or offer alternative perspectives that human coders might overlook. Their strength might lie in ‘destabilizing’ existing interpretations, augmenting human insights rather than fully automating interpretive labor.

For more detailed information, you can read the full research paper available at arXiv:2507.11198.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking LLM Agent Collaboration: How Temperature and Persona Influence Consensus and Accuracy in Qualitative Coding

Temperature’s Clear Impact on Consensus

Persona’s Mixed Influence

Surprising Accuracy Results

Qualitative Insights and Limitations

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates