CHECK: Enhancing Language Models for Complex Questions Through Semantic Reasoning

TLDR: The CHECK framework improves Large Language Models’ ability to answer multi-hop questions with updated information. It uses semantic analysis, similar to a compiler’s type checking, to ensure logical reasoning chains. By identifying and repairing inconsistencies in these chains, CHECK achieves an average 22.8% higher accuracy on multi-hop question answering compared to existing knowledge editing methods.

Large Language Models (LLMs) are incredibly powerful, trained on vast amounts of data to understand language and broad factual knowledge. They are used in many applications, from chatbots to question-answering systems. However, a significant challenge arises when the factual information stored within these models becomes outdated. Retraining an LLM from scratch is extremely expensive and environmentally burdensome.

This is where Knowledge Editing (KE) comes in. KE aims to update specific pieces of information in an LLM without requiring a full retraining. While existing KE methods have been successful for simple factual queries, they often struggle with more complex tasks that require “compositional reasoning,” such as multi-hop question answering (MQA). MQA involves answering questions that require multiple steps of reasoning, where intermediate facts might have been updated. For example, a question like “What is the country of citizenship of the author of Harry Potter?” could become problematic if the author’s citizenship information has been edited.

The issue with current knowledge editors for MQA is that they often rely on breaking down multi-hop problems into simpler, single-hop parts. While this decomposition seems intuitive, it can lead to illogical reasoning and the accidental use of irrelevant edited facts. Explicitly decomposing these questions can also introduce errors, loss of context, and even hallucinations from the LLM.

Introducing CHECK: A Semantic Analysis Approach

Researchers Dominic Simon and Rickard Ewetz from the University of Florida have proposed a novel knowledge editor for MQA called CHECK. This framework is inspired by an analogy between compilers and how LLMs reason. Just as source code is semantically analyzed (like type checking) before being compiled and executed, CHECK proposes to semantically analyze the reasoning chains generated by LLMs before they are used to answer questions. Reasoning chains that contain semantic errors are then revised to ensure consistency, either through logic optimization or by re-prompting the LLM at a higher “temperature” (which encourages more varied responses).

The core idea behind CHECK is to “type check” the reasoning process. Each step, or “hop,” in a multi-hop question is assigned a type – such as person, place, or thing. CHECK then verifies that the input and output types within each hop of a reasoning chain are consistent. If inconsistencies are found, the framework attempts to repair the chain by rearranging relationships or by asking the LLM to generate a new chain.

How CHECK Works

The CHECK framework involves three main steps:

First, Type Extraction identifies whether entities are persons, places, or things. Relationships between these entities also have predefined input and output types. This helps CHECK understand the expected flow of information.

Second, Question Decomposition breaks down the multi-hop question into a chain of relationships. This chain is then checked for “alignment.” If the order of relationships doesn’t logically connect (e.g., a “born in” relationship expects a person and outputs a place, but the next relationship expects a thing), CHECK attempts to rearrange the chain to make it coherent. If an aligned chain cannot be found, the LLM is prompted again with a higher temperature to encourage a different decomposition.

Third, Subquestion Resolution iteratively traverses this aligned relationship chain. For each step, CHECK compares the current entity and relationship against a bank of stored factual edits. If a sufficiently similar edit is found, the edited information is used as the answer for that hop. If no relevant edit is found, the LLM is prompted to generate and answer a question based on the current entity and relationship. This process continues until the final answer to the multi-hop question is obtained.

A key improvement in CHECK’s edit retrieval is its use of cosine similarity for comparing embeddings of subjects and relationships, which provides a clearer separation between exact and similar matches compared to previous methods.

Also Read:

Impressive Results

CHECK was evaluated against five state-of-the-art frameworks on four datasets, including various subsets of MQuAKE. The results showed a significant improvement in multi-hop question answering accuracy. Across different LLMs like GPT-J, Vicuna-7B, and Falcon-7B, CHECK consistently outperformed other methods, achieving an average 22.8% improved MQA accuracy. For instance, on the MQuAKE-CF-3k dataset, CHECK achieved a 31.57% increase in accuracy over the next best approach.

The research also explored how the number of “hops” (steps of reasoning) and “edits” (pieces of updated information) affected CHECK’s accuracy. As expected, accuracy decreased with more hops and more edits, as these make questions more complex. However, CHECK still maintained strong performance even with increased complexity.

This work demonstrates that semantically analyzing the reasoning process of knowledge editors is a highly effective way to improve the accuracy of LLMs when handling complex, multi-hop questions with updated information. You can read the full research paper here: Knowledge Editing for Multi-Hop Question Answering Using Semantic Analysis.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

CHECK: Enhancing Language Models for Complex Questions Through Semantic Reasoning

Introducing CHECK: A Semantic Analysis Approach

How CHECK Works

Impressive Results

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates