TLDR: A study compared Retrieval-Augmented Generation (RAG) and GraphRAG for answering questions and locating specific pages in a math textbook. Researchers found that standard embedding-based RAG methods achieved higher accuracy in retrieving the correct page and produced better quality answers compared to GraphRAG, which often retrieved too much irrelevant content. Re-ranking retrieved pages with an LLM showed mixed results and sometimes led to inaccuracies. The study highlights the potential of RAG for educational tools but emphasizes the need for more refined methods for reliable page-level referencing.
In the evolving landscape of technology-enhanced learning, artificial intelligence, particularly large language models (LLMs), is increasingly being explored to help students find relevant information during their studies. While LLMs are powerful for general question-answering, they often struggle to align with the specific domain knowledge found in course materials like textbooks, sometimes leading to incorrect or ‘hallucinated’ information.
To address this, researchers from Carnegie Mellon University investigated two advanced AI approaches: Retrieval-Augmented Generation (RAG) and GraphRAG. Their goal was to see how effectively these systems could answer questions and, crucially, pinpoint the exact page in an undergraduate mathematics textbook where the answer could be found. This ‘page-level retrieval’ is vital for building reliable AI tutoring solutions that can provide students with precise references.
Understanding RAG and GraphRAG
RAG is a method that combines information retrieval with LLMs. When a student asks a question, a ‘retriever’ module first identifies relevant documents or pages from a database (like a textbook). This retrieved content is then fed to a ‘generator’ module (an LLM) along with the student’s query to produce an answer. This two-stage process helps ground the LLM’s responses in specific, verified information, reducing the chances of hallucination.
GraphRAG is an extension of RAG that uses a knowledge graph. Instead of just indexing unstructured documents, GraphRAG builds a network of entities (concepts, objects) and their relationships extracted from the text. This structured approach aims to capture interconnected concepts and hierarchical knowledge, which could be particularly useful in subjects like mathematics where definitions and theorems are highly linked.
The Study’s Approach
The researchers curated a unique dataset of 477 question-answer pairs, each linked to a specific page from the textbook “An Infinite Descent into Pure Mathematics.” They then compared standard embedding-based RAG methods (which use numerical representations of text to find similar content) against GraphRAG. The evaluation focused on two key metrics: retrieval accuracy (whether the correct page was identified) and generated answer quality (measured by F1 scores, which assess how well the AI’s answer matches the correct answer).
Key Findings
Surprisingly, the study found that embedding-based RAG generally outperformed GraphRAG. For retrieval accuracy, one of the RAG models, ‘voyage-3-large’, achieved a remarkable 99% accuracy when allowed to retrieve up to 10 pages. GraphRAG also showed good accuracy (between 84% and 91%), but it retrieves entire entities rather than a ranked list of pages, making direct comparison slightly different.
In terms of generated answer quality, RAG approaches consistently improved F1 scores compared to a baseline LLM that didn’t use retrieval. However, GraphRAG showed lower F1 scores than the embedding-based RAG models. This suggests that while GraphRAG is good at finding conceptually linked entities, its impact on generating precise answers was not superior in this context.
The researchers also explored ‘re-ranking’ retrieved pages using an LLM to improve performance. However, this yielded mixed results. In some cases, it improved accuracy, but more often, especially with larger sets of retrieved documents, it led to performance drops and even instances of the LLM generating incorrect or non-existent page numbers (hallucinations).
Why GraphRAG Underperformed
The study identified two main reasons for GraphRAG’s lower performance in this specific task. Firstly, GraphRAG tended to retrieve an excessive amount of content, often including irrelevant information. On average, GraphRAG retrieved about 46,949 tokens per question, whereas a top-5 RAG retrieval only used about 3,743 tokens. This ‘noise’ in the context window can dilute precision and make it harder for the LLM to generate a focused answer.
Secondly, the page-level structure of the textbook, with questions tied to single pages, didn’t align well with GraphRAG’s entity-based representation. This mismatch meant the generative model had to process unwieldy or redundant contexts, impacting the quality of its output.
Also Read:
- Enhancing RAG Systems: A New Approach to Document Utility with Process Supervision
- Enhancing Facility Layout Problem Solving with AI-Powered Knowledge Graphs
Implications for AI Tutoring
This research highlights both the promise and challenges of using AI for page-level retrieval in educational settings. While RAG-based systems can significantly enhance the quality of answers for math textbook content, achieving perfect page-level retrieval remains a challenge. The findings emphasize the need for careful design choices, such as effective ‘chunking’ of information into pages and controlling the length of the context provided to the LLM, to build reliable AI tutors that can not only answer questions but also accurately point students to the relevant textbook pages.
The researchers have open-sourced their dataset and code to encourage further research in this area. You can find more details about their work in the full paper: Comparing RAG and GraphRAG for Page-Level Retrieval Question Answering on Math Textbook.


