TLDR: This research paper introduces a novel multi-agent framework for generating synthetic Question-Answering (QA) datasets specifically designed for evaluating Retrieval-Augmented Generation (RAG) systems. The framework employs a Diversity agent for broad topical coverage, a Privacy agent for detecting and masking sensitive information, and a QA Curation agent for synthesizing high-quality QA pairs. Experiments demonstrate that this approach significantly enhances dataset diversity compared to baseline methods and achieves robust privacy masking across various domain-specific datasets, laying a foundation for more comprehensive and ethically aligned RAG system evaluations.
In the rapidly evolving landscape of artificial intelligence, Retrieval-Augmented Generation (RAG) systems have emerged as a powerful tool to enhance the capabilities of large language models (LLMs). By integrating external knowledge, RAG systems enable LLMs to produce more informed and context-aware responses, finding applications in everything from specialized chatbots to code completion. However, the true effectiveness and trustworthiness of these systems hinge on robust evaluation methods, particularly those that account for real-world challenges like protecting sensitive information.
While much attention has been given to developing performance metrics for RAG, the quality and design of the underlying evaluation datasets often receive less focus. These datasets are crucial for meaningful and reliable assessments, yet traditional benchmarks frequently fall short in reflecting the complexity and variability of real-world use cases, often lacking coverage of novel or underrepresented topics. This gap makes it difficult to generalize evaluation results, especially in domains requiring high precision and expertise.
Addressing these critical challenges, a new research paper titled “Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework” introduces a novel approach to creating synthetic Question-Answering (QA) datasets. Authored by Ilias DRIOUICH, Hongliu CAO, and Eoin THOMAS from AMADEUS France, this work focuses on generating evaluation datasets that prioritize both semantic diversity and privacy preservation. You can read the full paper here.
A Multi-Agent Framework for Smarter Evaluation
The core of their solution is a modular multi-agent framework, designed to systematically generate high-quality synthetic QA datasets. This framework involves three specialized agents, each playing a distinct role:
- Diversity Agent: This agent uses clustering techniques to group similar documents from an original dataset. By selecting representative samples from each cluster, it ensures a broad coverage of topics and maximizes the semantic variability of the generated data.
- Privacy Agent: Operating on the diverse samples, this agent detects and masks sensitive information, such as Personally Identifiable Information (PII), across various domains. It pseudonymizes identified entities, creating a private version of the data and generating a detailed privacy report.
- QA Curation Agent: Finally, this agent synthesizes question-answer pairs from the privacy-preserved data. It leverages advanced prompting techniques to generate evaluation-ready samples and provides a QA generation report summarizing success rates and generation dynamics.
The entire process is orchestrated using the LangGraph framework, with GPT-4o used for the Diversity and QA curation agents due to its generation capabilities, and GPT-4.1 for the Privacy agent, leveraging its superior reasoning for accurate PII detection.
Demonstrated Effectiveness in Diversity and Privacy
The researchers conducted extensive experiments to validate their framework’s effectiveness. For diversity assessment, they compared their multi-agent system against two baselines: Evolutionary generation (RagasGen) and Direct Prompting (DirPmpt). Using the official EU AI Act as input, their system consistently outperformed both baselines across qualitative (LLM-as-a-Judge scores) and quantitative (Cosine Similarity to Diversity) metrics. Notably, as the size of the generated QA set increased, so did its diversity, indicating richer topic coverage and structural variation.
For privacy evaluation, the team utilized three benchmark datasets (PII-Masking, PWI-Masking, and PHI-Masking) containing sensitive entities from different domains (personal identifiers, workplace information, and health information, respectively). The privacy agent demonstrated strong overall performance, achieving high accuracy in masking various entity types. For instance, it scored 0.91 for DISABILITYSTATUS in the PHI dataset, 0.94 for JOBTYPE in the PWI dataset, and 0.91 for LASTNAME in the PII dataset. The consistent performance on overlapping labels like GENDER across datasets suggests the agent’s robust generalization capabilities.
Also Read:
- Evaluating RAG Systems: A New Framework for Multi-Hop Reasoning and Retrieval Difficulty
- Unveiling Privacy Vulnerabilities in Graph-Enhanced AI Systems
Paving the Way for Trustworthy AI Evaluation
This work offers a practical and ethically aligned pathway toward safer, more comprehensive RAG system evaluation. By focusing on both diversity and privacy, the framework addresses critical challenges in creating reliable benchmarks for AI systems. Looking ahead, the authors plan to enhance the autonomy and collaboration of individual agents, allowing them to dynamically infer optimal clustering structures and adaptively identify PIIs. Future work will also include rigorous evaluation of the framework’s resilience to privacy attacks and further alignment with evolving AI regulations like the EU AI Act.


