TLDR: This paper explores using large language models (LLMs) to generate natural language explanations for complex logical rules found in knowledge graphs. By employing strategies like providing variable entity types and Chain-of-Thought prompting, the researchers demonstrate that LLMs can produce accurate and clear explanations, making these rules more understandable for humans. The study also evaluates different LLMs and explores the potential of using LLMs as judges for explanation quality.
Knowledge graphs, which store factual information as interconnected data points, are fundamental to many artificial intelligence applications. However, these vast repositories of information are often incomplete. A key challenge in enhancing knowledge graphs is the ability to infer new facts and understand the underlying logical rules that govern these inferences.
For instance, if a knowledge graph indicates that a woman is the mother of a child, it’s highly probable that her husband is the child’s father. Identifying such logical rules can significantly improve the completeness of a knowledge graph, help detect potential errors, reveal subtle data patterns, and enhance the overall capacity for reasoning and interpretation within AI systems.
Despite their utility, these logical rules can be incredibly difficult for humans to understand. This difficulty stems from their abstract logical structure and the unique labeling conventions used within each knowledge graph. For example, predicates (the relationships between entities) in datasets like Freebase often follow complex formats, making them hard to decipher without specialized background knowledge.
To address this challenge, researchers from the University of Texas at Arlington, Nasim Shirvani-Mahdavi, Devin Wingfield, Amin Ghasemi, and Chengkai Li, have explored the potential of large language models (LLMs) to generate natural language explanations for these complex logical rules. Their work, titled “Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs,” is a pioneering effort in this area. You can find more details about their research here: Rule2Text Research Paper.
Unveiling the Research Approach
The team extracted logical rules using the AMIE 3.5.1 rule discovery algorithm, the latest version released in 2024. They applied this algorithm to a widely used benchmark dataset, FB15k-237, and two large-scale variants of the Freebase dataset, FB-CVT-REV and FB+CVT-REV. These datasets were chosen for their diverse relations and their ability to address data leakage issues found in earlier versions.
A particular challenge arose from “concatenated relations” in some datasets, where two underlying relations are merged, resulting in very long and complex labels. These complex labels can easily confuse language models, making the task of generating clear explanations even harder.
Prompting Strategies and Model Evaluation
The researchers investigated various prompting strategies to guide the LLMs in generating explanations. They conducted their experiments in three phases:
Phase 1: Zero-Shot vs. Few-Shot Prompting
Initially, they compared zero-shot prompting (where the model receives no examples) with few-shot prompting (where the model is given a couple of example rule-explanation pairs). Using OpenAI’s GPT-3.5 Turbo, they found that providing examples in the few-shot approach did not lead to significant improvements in explanation quality over the zero-shot baseline.
Phase 2: Utilizing Variable Entity Types
Recognizing limitations in the model’s ability to identify variable entity types within rules, the team integrated this information directly into the prompts. For example, if a rule involved a variable like “?b”, they would specify its potential types (e.g., “/time/event” or “/sports/sports_championship_event”). This crucial addition significantly improved the model’s performance in generating accurate explanations.
Phase 3: Comparing Models & Chain-of-Thought Prompting
Building on the success of incorporating variable entity types, the researchers further enhanced their approach with Chain-of-Thought (CoT) prompting. This strategy guides the LLM through a series of reasoning steps: parsing the rule, identifying components, determining relevant types for variables, interpreting each part of the rule, synthesizing the information, and finally generating a concise explanation. This phase also expanded the evaluation to include GPT-4o Mini and Gemini 2.0 Flash alongside GPT-3.5 Turbo.
Also Read:
- Advancing Explainability in Knowledge Graph Predictions
- Bridging the Gap: How AI Helps Access Manufacturing Knowledge Graphs
Key Findings and Future Directions
The human evaluation of the generated explanations focused on correctness (accuracy and logical order), clarity (ease of understanding), and the presence of missed or hallucinated entities and relations. The results were encouraging:
- The combination of Chain-of-Thought prompting and providing variable type information yielded the most accurate and readable explanations.
- Among the models tested, Gemini 2.0 Flash demonstrated the best overall performance, followed by GPT-4o Mini. GPT-3.5 Turbo also showed improved performance with CoT prompting.
- Models generally performed better on simpler rules (fewer components, binary relations) compared to more complex ones (three atoms, concatenated relations, or mediator nodes).
- The study also explored the concept of “LLM-as-a-judge,” where LLMs themselves evaluate the quality of generated explanations. While some biases were observed (LLMs tending to favor their own family’s models), the approach showed promise for scalable evaluation and generating pseudo-ground truth data for future model fine-tuning.
This research marks a significant step towards making complex logical rules in knowledge graphs more understandable for humans. While challenges remain, particularly with highly complex rules, the findings highlight a promising direction for enhancing the interpretability and usability of knowledge graphs through natural language explanations generated by large language models.


