Unveiling DRAGON: A Dynamic Benchmark for RAG Systems in Russian News

TLDR: DRAGON is the first dynamic benchmark for evaluating Retrieval-Augmented Generation (RAG) systems in Russian, using a continuously updated news corpus. It features automatic question generation from a Knowledge Graph, an open-source evaluation framework, and a public leaderboard. The benchmark aims to provide a more realistic and comprehensive assessment of RAG system performance, addressing the limitations of static, English-centric benchmarks.

Retrieval-Augmented Generation (RAG) is a powerful technique that helps large language models (LLMs) provide more accurate and up-to-date information by pulling in external knowledge. While RAG systems are becoming increasingly popular, especially in areas like open-domain question answering and customer support, evaluating their performance effectively is crucial. Most existing evaluation tools, or benchmarks, are designed for the English language and often use static datasets, which don’t reflect how real-world information changes constantly.

Introducing DRAGON: A Dynamic Benchmark

To address this gap, researchers have introduced DRAGON (Dynamic RAG Benchmark On News), the first dynamic benchmark specifically designed for evaluating RAG systems in Russian. DRAGON stands out because it uses a constantly updated collection of Russian news and public documents, making it a more realistic test environment for RAG systems. It allows for a thorough evaluation of both the ‘retriever’ part (which finds the information) and the ‘generator’ part (which creates the answer).

The benchmark automatically generates questions using a Knowledge Graph built from the news corpus. This process creates four main types of questions: simple, set, multi-hop, and conditional, each designed to test different aspects of a RAG system’s ability to understand and use information.

Key Contributions and How It Works

The DRAGON project offers several important contributions:

It’s the first RAG benchmark for Russian with a regularly updated knowledge base, designed for dynamic evaluation.
An open-source evaluation framework is provided, including a reusable question generation pipeline and evaluation scripts. This framework is flexible and can potentially be adapted for other languages and multilingual settings.
A public, regularly updated leaderboard has been launched to encourage community participation and comparison of RAG system performance.

The system works by periodically collecting content from popular Russian news websites. This new content is then processed, and a Knowledge Graph is extracted. This graph helps in generating new, relevant question-answer pairs. These pairs are then filtered for quality using linguistic checks and even an LLM-as-a-Judge approach, ensuring only high-quality questions make it into the final dataset.

For users, a Python library called `rag_bench` simplifies the evaluation process. Users can fetch the latest datasets, apply their RAG systems, and submit results to a secure validation portal. This portal evaluates submissions using private datasets, ensuring ground-truth data remains secure, and then publishes approved results to the public leaderboard.

Evaluation and Performance

The researchers conducted experiments to assess the quality of the generated questions and the performance of various RAG systems. Human evaluators confirmed the high quality and context dependency of the generated questions. In terms of retrieval, models like Qwen3 Embedding 8B and E5 Mistral7b Instruct showed strong performance. For end-to-end RAG system evaluation, the choice of both the retrieval model and the language model significantly impacted results. The study highlighted that traditional metrics like ROUGE-L might not fully capture all aspects of RAG performance, emphasizing the importance of the LLM-as-Judge evaluation.

Also Read:

Future Directions and Considerations

While DRAGON is a significant step forward, the authors acknowledge limitations such as its primary focus on news content and the Russian language, and the ongoing challenge of evaluation metrics fully reflecting human judgment. Future work aims to expand dataset diversity, refine evaluation criteria, and open-source previous snapshots of datasets for reproducibility.

The paper also touches upon ethical considerations, including addressing bias, ensuring data privacy, combating misinformation, promoting transparency, and considering the broader societal impacts of RAG systems. For more detailed information, you can refer to the full research paper: DRAGON: Dynamic RAG Benchmark On News.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unveiling DRAGON: A Dynamic Benchmark for RAG Systems in Russian News

Introducing DRAGON: A Dynamic Benchmark

Key Contributions and How It Works

Evaluation and Performance

Future Directions and Considerations

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Avalara Secures $500 Million Investment from BlackRock to Propel AI-Powered Tax Automation

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates