spot_img
HomeResearch & DevelopmentKGQAGen: A Framework for High-Quality Knowledge Graph Question Answering...

KGQAGen: A Framework for High-Quality Knowledge Graph Question Answering Datasets

TLDR: The paper introduces KGQAGen, an LLM-in-the-loop framework to create high-quality, verifiable Knowledge Graph Question Answering (KGQA) datasets. It addresses critical issues like inaccurate answers and ambiguous questions found in existing benchmarks (e.g., WebQSP, CWQ, which average only 57% correctness). KGQAGen uses structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging QA instances. The resulting KGQAGen-10k dataset, with 96.3% factual accuracy, reveals that even state-of-the-art models struggle, highlighting the need for better retrieval and reasoning in KG-RAG systems.

Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG) systems are becoming increasingly important for question answering, combining factual accuracy with structured inference. These systems rely heavily on high-quality benchmark datasets to measure progress and guide development. However, recent research has uncovered significant quality issues in popular KGQA datasets, compromising their utility.

A detailed manual audit of 16 widely used KGQA datasets, including prominent ones like WebQSP and CWQ, revealed a concerning average factual correctness rate of only 57%. Specific problems identified include inaccurate or outdated ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge bases. For instance, WebQSP and CWQ, dominant evaluation benchmarks, showed correctness rates of only 52% and 49.33% respectively.

To address these critical shortcomings, researchers introduced KGQAGen, an innovative LLM-in-the-loop framework designed to systematically resolve these pitfalls and construct high-quality benchmarks for KG-RAG systems. KGQAGen integrates structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable question-answering instances.

How KGQAGen Works

The KGQAGen framework operates in three main stages:

1. Seed Subgraph Initialization: The process begins by selecting a seed entity from a diverse set of topics (like Wikipedia Vital Articles) and constructing an initial local subgraph by retrieving related facts from a knowledge graph, such as Wikidata. This provides the initial context for reasoning.

2. Question Generation through Iterative LLM-Guided Subgraph Expansion: To create complex, multi-hop questions, the subgraph is iteratively expanded by traversing neighboring entities and relations. A Large Language Model (LLM) guides this expansion, evaluating whether the current subgraph contains enough information to support a well-formed, multi-hop question. Once deemed sufficient, the LLM generates a natural language question, identifies the corresponding answer set, extracts a minimal supporting subgraph, and constructs an associated SPARQL query. This ensures questions require at least two-hop reasoning, are specific, unambiguous, and naturally phrased.

3. Answer Validation and Refinement: The final stage ensures that each generated question-answer pair is faithfully grounded in the knowledge graph. The generated SPARQL query is executed against the knowledge base. If the results match the LLM-generated answer set, the instance is accepted. If not, a lightweight LLM (GPT-4o-mini) attempts to revise the SPARQL query, with a maximum of three attempts. This conservative filtering ensures only verifiable and KG-grounded instances are retained.

Using this framework, the researchers constructed KGQAGen-10k, a 10,787-scale benchmark grounded in Wikidata. A manual audit of 300 samples from KGQAGen-10k revealed an impressive 96.3% factual accuracy, demonstrating the framework’s effectiveness in producing reliable and well-grounded QA instances. The dataset features questions of moderate to deep linguistic complexity, with broad topic coverage across arts, astronomy, STEM fields, sports, geography, and philosophy.

Also Read:

Benchmarking Results and Insights

The KGQAGen-10k dataset was used to benchmark a diverse set of models, including pure LLMs and KG-RAG approaches. The results were insightful:

  • Even state-of-the-art systems like GPT-4.1 and recent KG-RAG models such as GCR and PoG achieved only moderate performance on KGQAGen-10k, highlighting the challenging nature of the benchmark and the limitations of existing models.
  • The study introduced LLM-Assisted Semantic Match (LASM) as an evaluation metric, which consistently yielded higher reported performance than traditional Exact Match (EM). This indicates that many predictions marked incorrect by EM were semantically correct, underscoring the importance of semantic-aware evaluation.
  • KG-RAG models showed noticeable gains over their pure LLM backbones, confirming that incorporating external knowledge graph context enhances QA performance. However, the overall improvement was moderate, suggesting that retrieval components in current KG-RAG systems are still suboptimal.
  • Models provided with the ground truth supporting subgraph (LLM-SP) achieved the strongest performance by a substantial margin. This highlights the critical role of high-quality retrieval in KG-RAG systems and suggests that retrieval remains a major bottleneck.

In conclusion, KGQAGen offers a scalable and effective framework for constructing challenging, high-quality benchmarks that can drive future progress in KG-RAG systems. The findings advocate for more rigorous benchmark construction and provide a valuable tool for diagnosing and addressing pitfalls in existing datasets. For more details, you can refer to the original research paper.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -