KGQAGen: A Framework for High-Quality Knowledge Graph Question Answering Datasets

TLDR: The paper introduces KGQAGen, an LLM-in-the-loop framework to create high-quality, verifiable Knowledge Graph Question Answering (KGQA) datasets. It addresses critical issues like inaccurate answers and ambiguous questions found in existing benchmarks (e.g., WebQSP, CWQ, which average only 57% correctness). KGQAGen uses structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging QA instances. The resulting KGQAGen-10k dataset, with 96.3% factual accuracy, reveals that even state-of-the-art models struggle, highlighting the need for better retrieval and reasoning in KG-RAG systems.

Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG) systems are becoming increasingly important for question answering, combining factual accuracy with structured inference. These systems rely heavily on high-quality benchmark datasets to measure progress and guide development. However, recent research has uncovered significant quality issues in popular KGQA datasets, compromising their utility.

A detailed manual audit of 16 widely used KGQA datasets, including prominent ones like WebQSP and CWQ, revealed a concerning average factual correctness rate of only 57%. Specific problems identified include inaccurate or outdated ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge bases. For instance, WebQSP and CWQ, dominant evaluation benchmarks, showed correctness rates of only 52% and 49.33% respectively.

To address these critical shortcomings, researchers introduced KGQAGen, an innovative LLM-in-the-loop framework designed to systematically resolve these pitfalls and construct high-quality benchmarks for KG-RAG systems. KGQAGen integrates structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable question-answering instances.

How KGQAGen Works

The KGQAGen framework operates in three main stages:

1. Seed Subgraph Initialization: The process begins by selecting a seed entity from a diverse set of topics (like Wikipedia Vital Articles) and constructing an initial local subgraph by retrieving related facts from a knowledge graph, such as Wikidata. This provides the initial context for reasoning.

2. Question Generation through Iterative LLM-Guided Subgraph Expansion: To create complex, multi-hop questions, the subgraph is iteratively expanded by traversing neighboring entities and relations. A Large Language Model (LLM) guides this expansion, evaluating whether the current subgraph contains enough information to support a well-formed, multi-hop question. Once deemed sufficient, the LLM generates a natural language question, identifies the corresponding answer set, extracts a minimal supporting subgraph, and constructs an associated SPARQL query. This ensures questions require at least two-hop reasoning, are specific, unambiguous, and naturally phrased.

3. Answer Validation and Refinement: The final stage ensures that each generated question-answer pair is faithfully grounded in the knowledge graph. The generated SPARQL query is executed against the knowledge base. If the results match the LLM-generated answer set, the instance is accepted. If not, a lightweight LLM (GPT-4o-mini) attempts to revise the SPARQL query, with a maximum of three attempts. This conservative filtering ensures only verifiable and KG-grounded instances are retained.

Using this framework, the researchers constructed KGQAGen-10k, a 10,787-scale benchmark grounded in Wikidata. A manual audit of 300 samples from KGQAGen-10k revealed an impressive 96.3% factual accuracy, demonstrating the framework’s effectiveness in producing reliable and well-grounded QA instances. The dataset features questions of moderate to deep linguistic complexity, with broad topic coverage across arts, astronomy, STEM fields, sports, geography, and philosophy.

Also Read:

Benchmarking Results and Insights

The KGQAGen-10k dataset was used to benchmark a diverse set of models, including pure LLMs and KG-RAG approaches. The results were insightful:

Even state-of-the-art systems like GPT-4.1 and recent KG-RAG models such as GCR and PoG achieved only moderate performance on KGQAGen-10k, highlighting the challenging nature of the benchmark and the limitations of existing models.
The study introduced LLM-Assisted Semantic Match (LASM) as an evaluation metric, which consistently yielded higher reported performance than traditional Exact Match (EM). This indicates that many predictions marked incorrect by EM were semantically correct, underscoring the importance of semantic-aware evaluation.
KG-RAG models showed noticeable gains over their pure LLM backbones, confirming that incorporating external knowledge graph context enhances QA performance. However, the overall improvement was moderate, suggesting that retrieval components in current KG-RAG systems are still suboptimal.
Models provided with the ground truth supporting subgraph (LLM-SP) achieved the strongest performance by a substantial margin. This highlights the critical role of high-quality retrieval in KG-RAG systems and suggests that retrieval remains a major bottleneck.

In conclusion, KGQAGen offers a scalable and effective framework for constructing challenging, high-quality benchmarks that can drive future progress in KG-RAG systems. The findings advocate for more rigorous benchmark construction and provide a valuable tool for diagnosing and addressing pitfalls in existing datasets. For more details, you can refer to the original research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

KGQAGen: A Framework for High-Quality Knowledge Graph Question Answering Datasets

How KGQAGen Works

Benchmarking Results and Insights

Gen AI News and Updates

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Autonomous AI Agents are Here: Why Your Data Strategy is Now Make-or-Break for Enterprise Success

UK Government’s AI Investment Surges Amidst Persistent Data Quality Challenges

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates