TLDR: DRAGON is the first dynamic benchmark for evaluating Retrieval-Augmented Generation (RAG) systems in Russian, using a continuously updated news corpus. It features automatic question generation from a Knowledge Graph, an open-source evaluation framework, and a public leaderboard. The benchmark aims to provide a more realistic and comprehensive assessment of RAG system performance, addressing the limitations of static, English-centric benchmarks.
Retrieval-Augmented Generation (RAG) is a powerful technique that helps large language models (LLMs) provide more accurate and up-to-date information by pulling in external knowledge. While RAG systems are becoming increasingly popular, especially in areas like open-domain question answering and customer support, evaluating their performance effectively is crucial. Most existing evaluation tools, or benchmarks, are designed for the English language and often use static datasets, which don’t reflect how real-world information changes constantly.
Introducing DRAGON: A Dynamic Benchmark
To address this gap, researchers have introduced DRAGON (Dynamic RAG Benchmark On News), the first dynamic benchmark specifically designed for evaluating RAG systems in Russian. DRAGON stands out because it uses a constantly updated collection of Russian news and public documents, making it a more realistic test environment for RAG systems. It allows for a thorough evaluation of both the ‘retriever’ part (which finds the information) and the ‘generator’ part (which creates the answer).
The benchmark automatically generates questions using a Knowledge Graph built from the news corpus. This process creates four main types of questions: simple, set, multi-hop, and conditional, each designed to test different aspects of a RAG system’s ability to understand and use information.
Key Contributions and How It Works
The DRAGON project offers several important contributions:
- It’s the first RAG benchmark for Russian with a regularly updated knowledge base, designed for dynamic evaluation.
- An open-source evaluation framework is provided, including a reusable question generation pipeline and evaluation scripts. This framework is flexible and can potentially be adapted for other languages and multilingual settings.
- A public, regularly updated leaderboard has been launched to encourage community participation and comparison of RAG system performance.
The system works by periodically collecting content from popular Russian news websites. This new content is then processed, and a Knowledge Graph is extracted. This graph helps in generating new, relevant question-answer pairs. These pairs are then filtered for quality using linguistic checks and even an LLM-as-a-Judge approach, ensuring only high-quality questions make it into the final dataset.
For users, a Python library called `rag_bench` simplifies the evaluation process. Users can fetch the latest datasets, apply their RAG systems, and submit results to a secure validation portal. This portal evaluates submissions using private datasets, ensuring ground-truth data remains secure, and then publishes approved results to the public leaderboard.
Evaluation and Performance
The researchers conducted experiments to assess the quality of the generated questions and the performance of various RAG systems. Human evaluators confirmed the high quality and context dependency of the generated questions. In terms of retrieval, models like Qwen3 Embedding 8B and E5 Mistral7b Instruct showed strong performance. For end-to-end RAG system evaluation, the choice of both the retrieval model and the language model significantly impacted results. The study highlighted that traditional metrics like ROUGE-L might not fully capture all aspects of RAG performance, emphasizing the importance of the LLM-as-Judge evaluation.
Also Read:
- GDGB: A New Benchmark for Generating Dynamic Text-Rich Graphs
- Assessing LLM Agent Memory: A New Benchmark for Interactive Intelligence
Future Directions and Considerations
While DRAGON is a significant step forward, the authors acknowledge limitations such as its primary focus on news content and the Russian language, and the ongoing challenge of evaluation metrics fully reflecting human judgment. Future work aims to expand dataset diversity, refine evaluation criteria, and open-source previous snapshots of datasets for reproducibility.
The paper also touches upon ethical considerations, including addressing bias, ensuring data privacy, combating misinformation, promoting transparency, and considering the broader societal impacts of RAG systems. For more detailed information, you can refer to the full research paper: DRAGON: Dynamic RAG Benchmark On News.


