TLDR: Researchers Aline Mangold and Kiran Hoffmann developed a human-centered questionnaire to evaluate Retrieval-Augmented Generation (RAG) system outputs. Building on existing frameworks, they refined 12 metrics through iterative testing, focusing on aspects like user intent, text structuring, and information verifiability. The study explored human-AI collaboration in evaluation, finding that while LLMs excel at focusing on metric descriptions, humans are better at detecting nuanced formatting issues. The final questionnaire provides a structured way to assess RAG outputs, emphasizing human understanding and usability, and offers guidelines for integrating human and LLM judgments effectively.
As large language models (LLMs) become increasingly integrated into applications we use daily, a critical challenge emerges: how do we effectively evaluate the quality of their outputs, especially when they are enhanced with Retrieval-Augmented Generation (RAG) systems? RAG systems are designed to make LLMs more accurate and knowledgeable by grounding their responses in external, up-to-date documents. However, despite their benefits, RAG systems can still produce less-than-ideal answers, sometimes even ‘hallucinating’ information if the retrieved documents aren’t properly aligned with the generated response.
While many existing evaluation methods focus on technical, computer-centered metrics like accuracy and relevance, they often overlook crucial human-centered aspects. These include how well an answer is formatted, its logical structure, or whether it truly addresses the user’s underlying intent. This gap highlights a significant need for evaluation frameworks that prioritize human understanding and usability, and that can effectively integrate both human and machine judgments.
Addressing this need, researchers Aline Mangold and Kiran Hoffmann from the Dresden University of Technology have developed a novel human-centered questionnaire for evaluating RAG outputs. Their work, detailed in the paper “Human-Centered Evaluation of RAG Outputs: A Framework and Questionnaire for Human–AI Collaboration”, introduces a systematic approach to assess RAG system performance from a user’s perspective.
Developing a Human-Centered Evaluation Tool
The development of this questionnaire was an iterative process, building upon Gienapp’s utility-dimension framework. Initially, a draft questionnaire was created with 12 metrics, each rated on a 5-point Likert scale with descriptive labels. This draft was then applied to real RAG output data, leading to the identification of several shortcomings. Issues included inconsistent wording, semantic overlap between metrics, and the absence of crucial aspects like redundancy in outputs.
Through multiple rounds of semantic discussions and refinements, the questionnaire evolved. New aspects like ‘salient clarity’ (how clearly key information is presented) and ‘model clarity’ (the presence of explanatory elements) were considered. The final version of the questionnaire comprises 12 metrics, carefully categorized for either human-only evaluation or collaborative human-LLM assessment. Metrics requiring fact-checking against source documents, such as ‘broad coverage’ and ‘external consistency’, were designated for human evaluators only.
The Role of AI in Evaluation
A key aspect of this research was exploring how LLMs could assist in the evaluation process. Using GPT-4o, the researchers developed a script to generate LLM ratings for metrics suitable for human-LLM collaboration. Initial attempts revealed that the LLM sometimes struggled to focus solely on the specified metrics. This was overcome by refining the system and user prompts, ensuring the LLM concentrated strictly on one criterion per API call and provided explanations with examples.
Insights from Human-AI Collaboration
The study involved two evaluation settings: one with human raters working independently and another where human raters collaborated with LLM ratings and explanations. The results showed a good level of agreement between human raters, indicating the reliability of the questionnaire. Interestingly, differences emerged in human-AI collaboration. For instance, LLMs were adept at recognizing basic formatting but sometimes missed nuanced inconsistencies that human raters identified. Conversely, humans sometimes struggled to strictly adhere to metric descriptions, occasionally letting factual correctness influence their ‘topical correctness’ ratings, whereas the LLM accurately focused on user intent.
Feedback from the human raters highlighted areas for improvement. There was confusion regarding the ‘broad coverage’ metric, with suggestions to reframe it towards ‘completeness’ rather than ‘diversity’, especially when only a single source document was involved. The definition of ‘factual correctness’ also proved ambiguous, leading to its eventual removal in favor of ‘external consistency’ (consistency with the source document) and a new ‘verifiability correctness’ metric (ease of verifying information in the source).
Also Read:
- Fairness Under Scrutiny: How Minor Prompt Changes Uncover Bias in RAG Systems
- Unmasking AI Judges: A New Approach to Detecting LLM-Generated Evaluations
The Refined Questionnaire and Its Implications
The final questionnaire reflects these insights, featuring metrics like ‘Logical Coherence’, ‘Stylistic Coherence’ (now focusing on consistent formatting), ‘Broad Coverage’ (reworded for completeness), ‘User Intent Correctness’ (formerly ‘topical correctness’), ‘Language Consistency’, and ‘Verifiability Correctness’. The metric ‘model clarity’ was discarded due to the impracticality of providing understandable textual explanations for every output.
This research underscores several important findings. LLMs can be valuable in evaluations by strictly adhering to specified criteria and providing explanations. However, human judgment remains crucial for detecting subtle issues and ensuring the output is truly useful and understandable to users. The study emphasizes that the final decision for a numeric rating should always rest with a human, and LLM ratings should ideally include direct examples from the source text for better verification.
From an ethical standpoint, the paper stresses the importance of human agency in AI evaluation. While machines can assess machines, ensuring outputs are understandable and useful to humans requires human involvement. The framework also has practical implications, offering a ready-to-use tool for RAG evaluation processes, though acknowledging the time and cost involved in human rating and LLM token generation.
While the questionnaire currently focuses on textual outputs and was evaluated with a small number of raters, it provides a robust foundation. Future research aims to validate the questionnaire with a larger pool of experts, incorporate metrics for visual explainability methods, and extend the framework to evaluate entire human-LLM dialogues, balancing both human-centered and computer-centered metrics for a holistic view of system performance.


