Improving AI's Knowledge with Denoised Graphs: A New Approach for RAG Systems

TLDR: A new framework called DEG-RAG improves Retrieval-Augmented Generation (RAG) systems by cleaning up noisy knowledge graphs generated by large language models (LLMs). It uses entity resolution to remove duplicate information and triple reflection to filter out incorrect relationships, leading to smaller, higher-quality knowledge graphs that boost question-answering performance across various RAG variants and datasets.

Large Language Models (LLMs) have transformed natural language processing, but they often struggle with issues like generating incorrect information (hallucination), factual inaccuracies, and outdated knowledge. Retrieval-Augmented Generation (RAG) systems address these problems by giving LLMs access to external, up-to-date information.

Graph-based RAG takes this a step further by using knowledge graphs (KGs), which are structured networks of entities and their relationships. This allows LLMs to leverage rich connections for more precise and inferential responses, moving beyond isolated text chunks to understand inter-document relations.

However, a significant challenge arises because most Graph-based RAG systems rely on LLMs to automatically build these knowledge graphs. This often results in ‘noisy’ KGs filled with redundant entities and unreliable relationships. Imagine a knowledge graph where “LLMs,” “LLM,” “Large Language Models,” and “modelos de lenguaje grandes” all refer to the same concept but are stored as separate entries. This redundancy not only slows down retrieval and generation but also increases computational costs. Crucially, existing research hasn’t fully tackled this denoising problem for LLM-generated KGs.

Introducing DEG-RAG: Denoising Knowledge Graphs for Better RAG

A new framework, called DEnoised knowledge Graphs for Retrieval Augmented Generation (DEG-RAG), has been introduced to tackle these challenges. DEG-RAG aims to create more compact and higher-quality KGs by focusing on two main techniques:

1. Entity Resolution: This process eliminates redundant entities. It identifies and links records that refer to the same real-world object. For example, it would merge “ARIME methodology” into “ARIMA model” or “K-means Algorithm” into “Clustering models.”

2. Triple Reflection: This technique removes erroneous relationships. Since external documents can contain incorrect information, and LLMs can make mistakes during extraction, DEG-RAG uses an LLM as a ‘judge’ to predict a reliability score for each relationship (triple). Triples below a certain threshold are then filtered out.

How Entity Resolution Works

Entity resolution in DEG-RAG involves several steps:

Blocking: To save computational resources, entities are first grouped into ‘blocks’ where they are more likely to be matched. This can be done based on semantic similarity (grouping entities with similar meanings), entity type (grouping entities of the same category), or structural similarity (grouping entities that share common neighbors in the graph).
Matching and Grouping: Within each block, entities that represent the same concept are identified and grouped together. This involves comparing entity embeddings (numerical representations) generated by various methods, including traditional KG embeddings (like ComplEx) or LLM embeddings. Different similarity metrics are used, such as comparing the entities themselves (ego node similarity), their neighbors (neighbor similarity), or a combination of both.
Merging or Linking: Once groups of equivalent entities are identified, the knowledge graph is updated. The most effective strategy found is ‘Direct Merging,’ where all redundant entities within a group are consolidated into a single, canonical entity. Their descriptions are combined (and summarized if too long), and their relationships are reconnected to the canonical entity. This significantly reduces redundancy and storage costs.

Key Findings and Impact

Experiments with DEG-RAG have shown impressive results. By removing approximately 40% of entities and 30-60% of relations from LLM-generated KGs, the framework consistently improved the question-answering performance of four popular Graph-based RAG approaches across diverse datasets. This highlights that the quality of a knowledge graph is often more important than its sheer size.

The research also provided valuable insights into the components of entity resolution:

Blocking: Entity type-based blocking proved to be the most effective strategy, suggesting that categorizing entities provides a strong foundation for identifying duplicates.
Embeddings: Surprisingly, traditional knowledge graph embeddings like ComplEx performed comparably to, and sometimes even better than, advanced LLM embeddings, especially in specific domains. This offers a viable alternative when LLM computational resources are limited.
Similarity: While ego node similarity (comparing the entities themselves) is crucial, incorporating information from an entity’s neighbors can further enhance performance.
Merging: Simple direct merging, which consolidates similar entities into one, generally outperformed merely linking them with ‘synonym’ relations, as it more effectively reduces redundancy.

Furthermore, the study found that DEG-RAG is robust, with performance remaining strong even when up to 70% of entities were reduced. This suggests that the framework can aggressively denoise KGs without negatively impacting the RAG system’s ability to answer questions, leading to much more compact and efficient knowledge bases.

Also Read:

Conclusion

The DEG-RAG framework offers a powerful solution to a critical problem in Graph-based RAG: the noise and redundancy in LLM-generated knowledge graphs. By systematically applying entity resolution and triple reflection, it significantly improves KG quality, reduces graph size, and enhances the overall performance of RAG systems. This work provides practical guidance for building better KGs and developing more efficient and accurate LLM applications. For more technical details, you can read the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Improving AI’s Knowledge with Denoised Graphs: A New Approach for RAG Systems

Introducing DEG-RAG: Denoising Knowledge Graphs for Better RAG

How Entity Resolution Works

Key Findings and Impact

Conclusion

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates