TLDR: This research explores using Large Language Models (LLMs) to generate human-readable explanations for complex, component-based Question Answering (QA) systems. Focusing on the observable data flows (SPARQL queries as input, RDF triples as output) within the Qanary framework, the study compares LLM-generated explanations with traditional template-based methods. The findings indicate that LLMs, particularly GPT-4, produce higher quality and more useful explanations, significantly improving the transparency and trustworthiness of AI-driven components.
In today’s rapidly evolving digital landscape, software systems, especially those powered by Artificial Intelligence (AI), have become incredibly complex. While these systems offer immense benefits, their intricate decision-making processes often remain opaque, leading to a lack of trust among users and making it challenging for developers to trace their behavior. This challenge is particularly pronounced in component-based systems, where individual AI-driven modules operate with encapsulated internal logic.
A recent research paper, “TOWARDS LLM-GENERATED EXPLANATIONS FOR COMPONENT-BASED KNOWLEDGE GRAPH QUESTION ANSWERING SYSTEMS,” by Dennis Schiese, Aleksandr Perevalov, and Andreas Both, tackles this critical issue. The authors propose an innovative approach to enhance the explainability of component-based Question Answering (QA) systems. Their core idea revolves around leveraging the systems’ internal data flows – specifically, the inputs as SPARQL queries and outputs as RDF triples – to generate clear, natural-language explanations of what each component does.
The researchers highlight that component-based systems, despite their complexity, offer a unique advantage for explainability. By breaking down processes into separate stages, it becomes possible to provide more detailed explanations for each step. The study uses the Qanary framework, a component-based QA system, as a practical case study. In this framework, components explicitly represent their input data as SPARQL queries and output data as RDF triples, making the data flow transparent and ripe for explanation.
Two Approaches to Explanation Generation
The paper explores two primary methods for verbalizing these data flows: a traditional template-based approach and a more advanced Large Language Model (LLM)-based approach. The template-based method relies on pre-defined templates with placeholders, which are filled with specific data from the system. While straightforward, this method can be rigid and costly to maintain or extend for new data types.
In contrast, the LLM-based approach utilizes powerful models like OpenAI’s GPT-3.5 and GPT-4. These models are capable of generating human-readable text from structured data, offering greater flexibility and automation. The researchers designed specific prompt templates for both input (SPARQL queries) and output (RDF triples) data to guide the LLMs in creating relevant explanations.
Evaluation and Key Findings
To assess the effectiveness of their approach, the authors conducted both quantitative and qualitative evaluations. The qualitative evaluation involved human experts, with backgrounds in Question Answering and Linked Data, rating the correctness and usefulness of the explanations on a 5-point Likert scale. For output data explanations, a quantitative analysis was also performed to measure accuracy, especially concerning the correct recognition of components and the number of annotations.
The results were compelling: LLM-generated explanations consistently outperformed the template-based baseline. For input data, all generative explanations achieved better results than their template-based counterparts. While the differences between zero-, one-, and few-shot LLM approaches were sometimes small, providing more examples generally improved performance. Interestingly, GPT-3.5 sometimes excelled in usefulness, while GPT-4 showed stronger performance in correctness and usefulness in other scenarios.
For output data, the quantitative evaluation revealed that GPT-4 significantly improved results, particularly for certain data types, demonstrating optimized recognition and processing of grounded RDF triples. The human expert evaluation further reinforced these findings, showing that LLM-generated explanations achieved comparable, if not superior, quality to template-based ones, with experts valuing their usefulness and correctness.
Also Read:
- Boosting Information Extraction: A New AI Workflow Combines Language Models with Logic
- Refining Knowledge Graph Searches with User-Defined Preferences
Broader Implications
The research concludes that LLMs are highly suitable for automatically generating human-readable explanations for complex system behaviors. This approach is not limited to Question Answering systems but offers a feasible method to explain the behavior of various component-based systems by establishing a semantic layer for their input and output data. This minimally invasive recording of data flows has wide applicability, enabling experts to gain a step-by-step understanding of how components process information.
The findings underscore the immense potential of using LLMs to make complex AI systems more transparent and trustworthy, addressing a critical need in the rapidly advancing field of Artificial Intelligence. For more detailed information, you can refer to the full research paper available at https://arxiv.org/pdf/2508.14553.


