TLDR: Auto-ARGUE is a new LLM-based tool and web app (ARGUE-viz) for automatically evaluating AI-generated reports, especially in Retrieval Augmented Generation (RAG) systems. It implements the ARGUE framework, assessing report quality based on content coverage (nuggets) and citation accuracy. A case study on TREC 2024 NeuCLIR showed good correlation with human judgments, providing a configurable and extensible solution for report generation evaluation.
In the rapidly expanding world of Retrieval Augmented Generation (RAG) systems, the ability to produce long-form, citation-backed reports is a crucial application. While many tools exist for evaluating various RAG tasks, there has been a noticeable gap when it comes to specific evaluation tools for report generation. This is where a new framework, Auto-ARGUE, steps in, offering a robust, LLM-based solution for assessing the quality of machine-generated reports.
Report generation (RG) is a distinct RAG task that aims to create detailed, citation-attributed responses to complex user queries. It differs from simpler tasks like long-form Question Answering in two key ways: first, it heavily considers the user’s identity or “requester,” meaning the same query might yield different reports for users with varying levels of expertise. Second, RG emphasizes comprehensive coverage of user-critical information across an entire document corpus, rather than just providing an adequate answer.
Understanding the ARGUE Framework
Auto-ARGUE is built upon the existing ARGUE framework, which is specifically designed for evaluating report generation. This framework assesses reports by making binary, sentence-level judgments about each sentence’s content and its associated citations. Depending on these judgments, a report can receive penalties, rewards, or neither. The inputs required by ARGUE include the generated report, the report request (which details the information need and user story), the document collection used, and a set of “nuggets” – essentially QA pairs that represent key information an ideal report should cover, linked to supporting documents.
Content evaluation within ARGUE focuses on how well a report covers relevant information, using these nuggets. Reports are rewarded for correctly answering nugget questions. Citation evaluation, on the other hand, ensures that citations accurately support the sentence they are attached to. Relevant citations are rewarded, while non-attesting or missing citations are penalized. While ARGUE is flexible regarding specific metrics, it recommends “sentence precision” (the proportion of sentences attested by their citations) and “nugget recall” (the proportion of nuggets correctly answered).
Auto-ARGUE: An LLM-Powered Implementation
Auto-ARGUE brings the ARGUE framework to life with an automatic, LLM-based Python implementation. It uses a large language model (LLM) as a “judge” to make the binary YES/NO judgments required by the framework, leveraging few-shot prompts. The system determines document relevance if it attests to a nugget answer. Nuggets themselves can be quite sophisticated, allowing for multiple answers, “AND” or “OR” conditions for answers, and importance labels (e.g., “vital” or “okay”). Auto-ARGUE calculates the recommended metrics—sentence precision and nugget recall—and also offers a weighted variant of nugget recall, along with F1 scores, to provide an overall assessment of a report.
Visualizing Results with ARGUE-viz
To make the evaluation process even more accessible and understandable, the researchers have also released ARGUE-viz, a user-friendly web application built with Streamlit. This tool allows users to visualize Auto-ARGUE’s outputs for individual runs, presenting both aggregated metrics and fine-grained judgments. Users can easily switch between per-topic results and overall metrics. The visualization includes detailed judgment information, offering both a report-level view (showing supported sentences and correctly answered nuggets) and a sentence-level view (detailing which nugget answers are attested or not by specific sentences). This level of detail is invaluable for human analysis of errors and for guiding system development.
Also Read:
- Evaluating RAG Outputs: A New Human-Centered Approach for AI Collaboration
- Fairness Under Scrutiny: How Minor Prompt Changes Uncover Bias in RAG Systems
Real-World Application and Promising Results
The effectiveness of Auto-ARGUE was put to the test in a case study using the TREC 2024 NeuCLIR report generation pilot task. This task involved generating English reports from non-English document collections (Chinese, Russian, Farsi). Auto-ARGUE, using Llama-3.3 70B as its LLM judge, demonstrated good correlations with human judgments when comparing system rankings based on sentence precision and nugget recall. The results were particularly strong for sentence precision, indicating the system’s capability to align with human assessments. This suggests that even more capable LLM judges could further enhance its performance.
The introduction of Auto-ARGUE and ARGUE-viz marks a significant step forward in the automatic evaluation of report generation systems. By providing a robust, configurable, and easy-to-use tool, the researchers aim to facilitate further advancements in this critical area of RAG applications. For more in-depth information, you can read the full research paper here.


