TLDR: This paper introduces “Millions of GeAR-s,” an extension of the GraphRAG system GeAR, designed to scale to millions of documents. It proposes an online method to align retrieved passages with Wikidata triples, bypassing expensive offline LLM-based triple extraction. While achieving good performance, the research identifies challenges with semantic misalignment between text and knowledge graph data, highlighting the need for improved semantic models for large-scale GraphRAG.
Retrieval-augmented Generation (RAG) has significantly boosted the performance of Large Language Models (LLMs) in answering questions. While effective for simple, single-hop queries, tackling multi-hop questions, which require reasoning across multiple pieces of information, remains a significant challenge.
Recent advancements have explored graph-based RAG approaches, often called GraphRAG, which leverage structured information like entities and their relationships extracted from documents. These methods have shown impressive results on various multi-hop question answering datasets. However, a major hurdle for GraphRAG has been its scalability; these systems typically work well with datasets containing up to hundreds of thousands of passages, but struggle when faced with millions or even billions of documents.
A new research paper, titled “Millions of GeAR-s : Extending GraphRAG to Millions of Documents”, addresses this scalability issue. Authored by Zhili Shen, Chenxin Diao, Pascual Merita, Pavlos Vougiouklis, and Jeff Z. Pan, the paper details their efforts to adapt a state-of-the-art GraphRAG solution called GeAR to handle massive datasets, specifically for the SIGIR 2025 LiveRAG Challenge.
Traditional GraphRAG methods often rely on LLMs to extract knowledge triples (subject-predicate-object facts) from documents offline, which can be prohibitively expensive and time-consuming for web-scale corpora. The authors of this paper propose a novel approach to bypass this costly offline triple extraction step entirely. Instead, their adapted GeAR system iteratively pseudo-aligns passages retrieved during a baseline retrieval step (like BM25) with triples from an existing external knowledge graph, such as Wikidata.
This online alignment strategy allows the system to expand these triples into candidate reasoning chains, which are then used to retrieve additional passages along more distant reasoning paths relevant to the original question. The system uses Falcon-3B-Instruct as a “knowledge synchroniser” and for key processes like query re-writing and answering.
The researchers evaluated their submission, “Graph-Enhanced RAG,” and achieved correctness and faithfulness scores of 0.875714 and 0.529335, respectively. A crucial observation from their experiments was the potential for misalignment when linking proximal triples from FineWeb passages to Wikidata triples. For instance, a topic might shift from ‘pacific geoducks’ to ‘pacific oyster’ after linking, indicating a divergence in subject matter.
Also Read:
- Enhancing LLM Responses: A New Approach to Combining Embedding Models in RAG
- Enhancing Developer Support with Adaptive AI Retrieval for Language Models
This misalignment highlights a limitation in the current framework and underscores the need for more advanced asymmetric semantic models. These models would be capable of operating within a shared semantic space for both graph data and text, which is essential for extending the benefits of GraphRAG to large-scale applications. The paper concludes by emphasizing that while GraphRAG methods excel in multi-hop reasoning, their widespread adoption for massive datasets requires further innovation in how knowledge graphs and textual passages are aligned and understood. You can find the full paper here: Millions of GeAR-s.


