Scaling Graph-Enhanced RAG: A New Approach for Millions of Documents

TLDR: This paper introduces “Millions of GeAR-s,” an extension of the GraphRAG system GeAR, designed to scale to millions of documents. It proposes an online method to align retrieved passages with Wikidata triples, bypassing expensive offline LLM-based triple extraction. While achieving good performance, the research identifies challenges with semantic misalignment between text and knowledge graph data, highlighting the need for improved semantic models for large-scale GraphRAG.

Retrieval-augmented Generation (RAG) has significantly boosted the performance of Large Language Models (LLMs) in answering questions. While effective for simple, single-hop queries, tackling multi-hop questions, which require reasoning across multiple pieces of information, remains a significant challenge.

Recent advancements have explored graph-based RAG approaches, often called GraphRAG, which leverage structured information like entities and their relationships extracted from documents. These methods have shown impressive results on various multi-hop question answering datasets. However, a major hurdle for GraphRAG has been its scalability; these systems typically work well with datasets containing up to hundreds of thousands of passages, but struggle when faced with millions or even billions of documents.

A new research paper, titled “Millions of GeAR-s : Extending GraphRAG to Millions of Documents”, addresses this scalability issue. Authored by Zhili Shen, Chenxin Diao, Pascual Merita, Pavlos Vougiouklis, and Jeff Z. Pan, the paper details their efforts to adapt a state-of-the-art GraphRAG solution called GeAR to handle massive datasets, specifically for the SIGIR 2025 LiveRAG Challenge.

Traditional GraphRAG methods often rely on LLMs to extract knowledge triples (subject-predicate-object facts) from documents offline, which can be prohibitively expensive and time-consuming for web-scale corpora. The authors of this paper propose a novel approach to bypass this costly offline triple extraction step entirely. Instead, their adapted GeAR system iteratively pseudo-aligns passages retrieved during a baseline retrieval step (like BM25) with triples from an existing external knowledge graph, such as Wikidata.

This online alignment strategy allows the system to expand these triples into candidate reasoning chains, which are then used to retrieve additional passages along more distant reasoning paths relevant to the original question. The system uses Falcon-3B-Instruct as a “knowledge synchroniser” and for key processes like query re-writing and answering.

The researchers evaluated their submission, “Graph-Enhanced RAG,” and achieved correctness and faithfulness scores of 0.875714 and 0.529335, respectively. A crucial observation from their experiments was the potential for misalignment when linking proximal triples from FineWeb passages to Wikidata triples. For instance, a topic might shift from ‘pacific geoducks’ to ‘pacific oyster’ after linking, indicating a divergence in subject matter.

Also Read:

This misalignment highlights a limitation in the current framework and underscores the need for more advanced asymmetric semantic models. These models would be capable of operating within a shared semantic space for both graph data and text, which is essential for extending the benefits of GraphRAG to large-scale applications. The paper concludes by emphasizing that while GraphRAG methods excel in multi-hop reasoning, their widespread adoption for massive datasets requires further innovation in how knowledge graphs and textual passages are aligned and understood. You can find the full paper here: Millions of GeAR-s.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Scaling Graph-Enhanced RAG: A New Approach for Millions of Documents

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates