TLDR: This research paper compares LLM-generated and human-authored responses in multi-turn, knowledge-grounded role-play dialogues. Through human evaluation (N=38) and automated LLM-as-a-judge assessments, the study found that LLM response quality significantly degraded over turns in naturalness, context maintenance, and overall quality, while human-authored responses improved. Participants consistently preferred human-authored dialogues. The work provides a multi-turn benchmark for LLM degradation and a validated hybrid evaluation framework for training simulations, highlighting current limitations of LLMs in sustaining high-quality, context-sensitive responses over extended interactions.
Large Language Models (LLMs) are increasingly being used in various role-play dialogue systems, from healthcare training to educational settings. These systems aim to simulate real-life scenarios, requiring LLMs to generate contextually appropriate responses, stay grounded in specific character profiles, and maintain domain-specific knowledge over extended interactions. However, evaluating these LLMs, especially in long, multi-turn conversations, has remained a significant challenge.
A recent study titled Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues by Dongxu Lu, Johan Jeuring, and Albert Gatt from Utrecht University, delves into this challenge. The researchers conducted a comprehensive comparison of LLM-generated and human-authored responses in multi-turn professional training simulations, using both human evaluation and an automated LLM-as-a-judge assessment.
The Challenge of Multi-Turn Dialogues
Existing benchmarks for LLMs often fall short in evaluating long-form, knowledge-grounded role-play. Many focus on open-domain or task-oriented conversations, which don’t fully capture the demands of professional training. Furthermore, most evaluations tend to disregard the multi-turn nature of dialogues, often assessing performance over only a few turns. This overlooks a critical issue: LLMs can experience performance degradation over longer interactions, a phenomenon where their ability to maintain context and quality diminishes over time.
Experiment 1: Human Evaluation
To address these gaps, the researchers conducted an initial experiment involving 38 participants. They focused on a negotiation skills scenario, presenting participants with 23 conversational exchanges. For each turn, participants evaluated two types of agent responses: human-authored (pre-scripted ‘best-practice’ content) and LLM-generated (from a fine-tuned LLAMA3 model). Participants rated responses on six quality dimensions: Understandable, Natural, Maintains Context, Interesting, Uses Knowledge, and Overall Quality. They also indicated their preference between the two response types.
The human evaluation revealed a clear trend: the perceived quality of LLM-generated responses significantly degraded as the dialogue progressed. This decline was particularly noticeable in ‘Naturalness,’ ‘Context Maintenance,’ and ‘Overall Quality.’ In contrast, human-authored responses progressively improved in quality over time. Participants consistently preferred human-authored dialogues. A focus group with instructional designers further highlighted the importance of ‘Natural Flow,’ ‘Contextual Fit,’ ‘Tone Appropriateness,’ ‘Pedagogical Nudging,’ and ‘Sentence Length’ for effective training simulations.
Experiment 2: Automated Evaluation with LLM-as-a-Judge
To see if these findings generalized across different conversational contexts, the team designed a second experiment using an LLM-as-a-judge approach. They first validated how well LLMs (specifically GEMINI2.0 FLASH) could mimic human judgments from Experiment 1. They found that a few-shot prompting strategy, particularly random sampling with six examples, significantly enhanced the LLM judge’s alignment with human evaluations.
Applying this validated method to three additional scenarios (motivational interviewing, selling, and consulting), the automated evaluation reinforced the initial findings. The LLM-as-a-judge consistently showed a strong preference for human-authored responses. It also confirmed the diverging quality trends: human-authored responses maintained or improved in quality, while LLM-generated responses showed a less positive trend over time.
Also Read:
- Enhancing LLM Conversations: A New Strategy to Combat Forgetting and Boost Efficiency
- Unpacking LLM Sequential Reasoning: A Look at seqBench
Key Takeaways and Future Directions
The study’s principal findings underscore that while LLMs hold promise for role-play dialogues, they currently struggle to sustain high-quality, context-sensitive responses across extended interactions. This ‘long-context degradation’ is a significant barrier. Human authors remain the gold standard for crafting engaging and pedagogically effective role-play scenarios.
The research also contributed a validated hybrid evaluation framework, combining human and automated assessments, which can guide the reliable integration of LLMs in training simulations. However, the study acknowledges limitations, including the single scenario for human evaluation and potential biases in LLM-as-a-judge when evaluating its own kind of output. Future work will explore broader dialogue scenarios and strategies to mitigate performance decay in LLMs over extended interactions.


