TLDR: A new study explores how large language models (LLMs) like GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.0 Flash engage in multi-turn moral debates using everyday dilemmas. Researchers found significant behavioral differences across synchronous and round-robin deliberation formats. GPT-4.1 showed strong inertia in synchronous settings but high conformity in round-robin, while Claude and Gemini were more flexible. Models’ value patterns diverged, with GPT emphasizing personal autonomy and Claude/Gemini prioritizing empathetic dialogue. Value alignment increased with verdict consensus, and deliberation format strongly shaped how models revised their judgments, highlighting that AI’s ethical behavior is deeply intertwined with interaction structure.
As large language models (LLMs) become increasingly integrated into our daily lives, offering everything from personal advice to mental health support, understanding their underlying values and how they navigate complex moral situations is crucial. While many evaluations focus on single-turn interactions, a recent study delves into multi-turn settings, exploring how LLMs deliberate, revise their stances, and reach consensus in debates.
The research, titled “DELIBERATIVEDYNAMICS ANDVALUEALIGNMENT IN LLM DEBATES” by Pratik S. Sachdeva and Tom van Nuenen from the University of California, Berkeley, investigates the intricate dynamics of LLM debates. The authors aimed to bridge the gap in understanding how sociotechnical alignment – the alignment of AI with human values and norms – manifests in dialogues where values are negotiated rather than expressed in isolation. You can find the full paper here.
How the Study Was Conducted
To examine these deliberative dynamics, the researchers used 1,000 everyday moral dilemmas sourced from Reddit’s popular “Am I the Asshole” (AITA) community. They prompted subsets of three prominent LLMs – GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.0 Flash – to collectively assign blame in these scenarios. The study employed two distinct deliberation formats:
- Synchronous (Parallel Responses): Models responded independently and simultaneously. If they disagreed, they would see each other’s reasoning and have a chance to revise their verdict in subsequent rounds.
- Round-Robin (Sequential Responses): Models responded one after another, with each model seeing the previous models’ verdicts and explanations before giving their own. This format allowed for testing order effects.
To analyze the values invoked by the models, an external LLM (Gemini 2.5 Flash) classified up to five values present in each model’s explanation, drawing from a specialized taxonomy of values relevant to moral dilemmas.
Key Findings: Striking Behavioral Differences
The study revealed significant differences in how the models behaved across the deliberation formats:
In the synchronous setting, GPT-4.1 demonstrated strong “inertia,” meaning it was very resistant to changing its initial verdict, with very low revision rates (0.6-3.1%). In contrast, Claude 3.7 Sonnet and Gemini 2.0 Flash were far more flexible, showing much higher revision rates (28-41%).
The models also exhibited distinct value patterns. GPT-4.1 tended to emphasize values like personal autonomy and direct communication in its reasoning. Claude 3.7 Sonnet and Gemini 2.0 Flash, however, prioritized empathetic dialogue and conflict resolution. Interestingly, the study found that when models reached a consensus on a verdict, their underlying value sets became significantly more similar, suggesting a strong link between value convergence and agreement.
The deliberation format itself proved to be a powerful factor. In the round-robin setting, GPT-4.1 and Gemini 2.0 Flash showed high conformity, with their verdict behavior strongly influenced by the order in which they responded. This was a notable shift for GPT-4.1, which had been highly inertial in the synchronous setting, indicating that a model’s “personality” (like inertia or conformity) is not fixed but can change based on the interaction structure.
Furthermore, the researchers explored whether modifying the system prompt could steer model behavior. By instructing models to balance consensus-seeking with correctness, GPT-4.1 showed a significant increase in its verdict revision rate, though it still remained less flexible than Claude and Gemini. This suggests that while prompts can influence behavior, they may not entirely override inherent model tendencies.
Also Read:
- Bridging the Language Gap: Evaluating LLM Morality Across Global Contexts
- AI Models Evaluate User Interfaces: A New Benchmark for Design Feedback
Implications for AI Alignment
These findings underscore a critical insight: sociotechnical alignment in LLMs depends not just on what values they are trained on or what they output, but also on how dialogue and interaction are structured. The study highlights that behaviors like inertia (sticking to an initial stance) and sycophancy (agreeing too readily) are not fixed traits but emerge from the interaction context. As LLMs are deployed in more sensitive roles, understanding these dynamic behaviors and how different deliberation formats shape their moral reasoning will be essential for building more reliable and ethically aligned AI systems.


