TLDR: The CHECK framework improves Large Language Models’ ability to answer multi-hop questions with updated information. It uses semantic analysis, similar to a compiler’s type checking, to ensure logical reasoning chains. By identifying and repairing inconsistencies in these chains, CHECK achieves an average 22.8% higher accuracy on multi-hop question answering compared to existing knowledge editing methods.
Large Language Models (LLMs) are incredibly powerful, trained on vast amounts of data to understand language and broad factual knowledge. They are used in many applications, from chatbots to question-answering systems. However, a significant challenge arises when the factual information stored within these models becomes outdated. Retraining an LLM from scratch is extremely expensive and environmentally burdensome.
This is where Knowledge Editing (KE) comes in. KE aims to update specific pieces of information in an LLM without requiring a full retraining. While existing KE methods have been successful for simple factual queries, they often struggle with more complex tasks that require “compositional reasoning,” such as multi-hop question answering (MQA). MQA involves answering questions that require multiple steps of reasoning, where intermediate facts might have been updated. For example, a question like “What is the country of citizenship of the author of Harry Potter?” could become problematic if the author’s citizenship information has been edited.
The issue with current knowledge editors for MQA is that they often rely on breaking down multi-hop problems into simpler, single-hop parts. While this decomposition seems intuitive, it can lead to illogical reasoning and the accidental use of irrelevant edited facts. Explicitly decomposing these questions can also introduce errors, loss of context, and even hallucinations from the LLM.
Introducing CHECK: A Semantic Analysis Approach
Researchers Dominic Simon and Rickard Ewetz from the University of Florida have proposed a novel knowledge editor for MQA called CHECK. This framework is inspired by an analogy between compilers and how LLMs reason. Just as source code is semantically analyzed (like type checking) before being compiled and executed, CHECK proposes to semantically analyze the reasoning chains generated by LLMs before they are used to answer questions. Reasoning chains that contain semantic errors are then revised to ensure consistency, either through logic optimization or by re-prompting the LLM at a higher “temperature” (which encourages more varied responses).
The core idea behind CHECK is to “type check” the reasoning process. Each step, or “hop,” in a multi-hop question is assigned a type – such as person, place, or thing. CHECK then verifies that the input and output types within each hop of a reasoning chain are consistent. If inconsistencies are found, the framework attempts to repair the chain by rearranging relationships or by asking the LLM to generate a new chain.
How CHECK Works
The CHECK framework involves three main steps:
First, Type Extraction identifies whether entities are persons, places, or things. Relationships between these entities also have predefined input and output types. This helps CHECK understand the expected flow of information.
Second, Question Decomposition breaks down the multi-hop question into a chain of relationships. This chain is then checked for “alignment.” If the order of relationships doesn’t logically connect (e.g., a “born in” relationship expects a person and outputs a place, but the next relationship expects a thing), CHECK attempts to rearrange the chain to make it coherent. If an aligned chain cannot be found, the LLM is prompted again with a higher temperature to encourage a different decomposition.
Third, Subquestion Resolution iteratively traverses this aligned relationship chain. For each step, CHECK compares the current entity and relationship against a bank of stored factual edits. If a sufficiently similar edit is found, the edited information is used as the answer for that hop. If no relevant edit is found, the LLM is prompted to generate and answer a question based on the current entity and relationship. This process continues until the final answer to the multi-hop question is obtained.
A key improvement in CHECK’s edit retrieval is its use of cosine similarity for comparing embeddings of subjects and relationships, which provides a clearer separation between exact and similar matches compared to previous methods.
Also Read:
- KCR: Empowering LLMs to Master Knowledge Conflicts in Extensive Texts
- Adaptive Reasoning for Large Language Models: The SynAdapt Approach
Impressive Results
CHECK was evaluated against five state-of-the-art frameworks on four datasets, including various subsets of MQuAKE. The results showed a significant improvement in multi-hop question answering accuracy. Across different LLMs like GPT-J, Vicuna-7B, and Falcon-7B, CHECK consistently outperformed other methods, achieving an average 22.8% improved MQA accuracy. For instance, on the MQuAKE-CF-3k dataset, CHECK achieved a 31.57% increase in accuracy over the next best approach.
The research also explored how the number of “hops” (steps of reasoning) and “edits” (pieces of updated information) affected CHECK’s accuracy. As expected, accuracy decreased with more hops and more edits, as these make questions more complex. However, CHECK still maintained strong performance even with increased complexity.
This work demonstrates that semantically analyzing the reasoning process of knowledge editors is a highly effective way to improve the accuracy of LLMs when handling complex, multi-hop questions with updated information. You can read the full research paper here: Knowledge Editing for Multi-Hop Question Answering Using Semantic Analysis.


