Structured Answers: Interpretable QA Using Knowledge Graphs

TLDR: This paper introduces a question answering system that uses knowledge graphs exclusively, bypassing traditional retrieval-augmented generation (RAG) with large language models to enhance interpretability and reduce hallucination. The system generates QA pairs from documents, converts them into a knowledge graph, and then performs graph-based retrieval, reranking, and paraphrasing to produce answers. Evaluated on the CRAG benchmark using LLM-as-a-judge, it achieved competitive accuracies, demonstrating a practical alternative for domains requiring factual consistency and transparency.

Question Answering (QA) systems are designed to provide precise and relevant answers to user queries. Traditionally, many modern QA systems rely on Retrieval-Augmented Generation (RAG), where a user’s query retrieves relevant text chunks that are then processed by a large language model (LLM) to generate an answer. However, these RAG systems often depend heavily on unstructured documents and can be prone to issues like hallucination and limited transparency.

A recent research paper, titled “Interpretable Question Answering with Knowledge Graphs,” proposes an alternative approach that operates exclusively on a knowledge graph retrieval system, without using RAG with LLMs. This method aims to offer more interpretable reasoning and contextual associations, making it particularly suitable for tasks where factual consistency and traceability are crucial, such as in legal or technical fields.

The system, developed by Kartikeya Aneja, Manasvi Srivastava, Subhayan Das, and Nagender Aneja, uses a small paraphraser model to rephrase the entity-relationship edges retrieved from the knowledge graph, generating human-readable answers.

How the System Works

The proposed pipeline is divided into two main stages. The first stage involves pre-processing a document to generate sets of question-answer (QA) pairs. For PDF documents, text is divided into semantically meaningful units, and a Hugging Face model (iarfmoose/t5-base-question-generator) is used to automatically generate QA pairs, extracting atomic facts and knowledge.

The second stage converts these QA pairs into a structured knowledge graph. This is achieved using LangChain’s LLMGraphTransformer with GPT-3.5 Turbo, which identifies entities and their semantic relationships. These nodes (entities) and edges (relationships) are then stored in a Neo4j graph database. Embeddings are generated for each node and node type using the text-embedding-3-large model, enhancing the graph’s search capabilities.

Retrieval and Answer Generation

When a user query is presented, the system employs a three-step retrieval process:

Node-Level Semantic Matching: It computes the cosine similarity between the input question’s embedding and the embeddings of candidate nodes in the knowledge graph, selecting highly similar nodes and their immediate relationships.
Type-Level Generalization Retrieval: The system identifies the node type most semantically similar to the question, retrieving all associated nodes and relationships. This is useful for general or class-level questions.
Fuzzy Entity Matching: Entities explicitly mentioned in the question are identified using a lightweight Hugging Face model (dslim/bert-base-NER), and fuzzy matching is performed against graph nodes, allowing for minor variations or misspellings.

Following retrieval, the selected nodes and relationships are passed to a lightweight paraphrasing model (tuner007/pegasus_paraphrase) to synthesize a fluent natural language response. These paraphrased answers are then sent to a reranking model (BAAI/bge-reranker-large), which ranks them based on semantic relevance to the input question. The top five highest-ranked responses are chosen as the final answers.

Also Read:

Evaluation and Results

The system’s performance was evaluated using an LLM-as-a-judge approach, employing Llama-3.2 and GPT-3.5 Turbo as automatic evaluators. On a custom PDF dataset, the system achieved accuracies of 89.6% with Llama-3.2 and 78.3% with GPT-3.5 Turbo. When tested on the CRAG benchmark, which includes 2398 valid question-answer pairs, the system achieved accuracies of 71.9% with Llama-3.2 and 54.4% with GPT-3.5 Turbo, demonstrating competitive performance compared to existing RAG approaches.

Robustness was also assessed by perturbing questions while preserving their original intent. For the PDF dataset, the accuracy for perturbed questions was 85.6% with Llama-3.2 and 72.6% with GPT-3.5 Turbo, indicating the system’s ability to handle variations in queries.

This research highlights the practical utility of a knowledge graph-based approach for question answering, providing interpretable, ranked responses that can include partially correct or contextually relevant information. This makes it valuable for exploratory QA, entity discovery, and knowledge validation. The code for the experiment, methods, and dataset is publicly available for further exploration. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Structured Answers: Interpretable QA Using Knowledge Graphs

How the System Works

Retrieval and Answer Generation

Evaluation and Results

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates