Enhancing Math Textbook Learning: A Deep Dive into AI Retrieval Systems

TLDR: A study compared Retrieval-Augmented Generation (RAG) and GraphRAG for answering questions and locating specific pages in a math textbook. Researchers found that standard embedding-based RAG methods achieved higher accuracy in retrieving the correct page and produced better quality answers compared to GraphRAG, which often retrieved too much irrelevant content. Re-ranking retrieved pages with an LLM showed mixed results and sometimes led to inaccuracies. The study highlights the potential of RAG for educational tools but emphasizes the need for more refined methods for reliable page-level referencing.

In the evolving landscape of technology-enhanced learning, artificial intelligence, particularly large language models (LLMs), is increasingly being explored to help students find relevant information during their studies. While LLMs are powerful for general question-answering, they often struggle to align with the specific domain knowledge found in course materials like textbooks, sometimes leading to incorrect or ‘hallucinated’ information.

To address this, researchers from Carnegie Mellon University investigated two advanced AI approaches: Retrieval-Augmented Generation (RAG) and GraphRAG. Their goal was to see how effectively these systems could answer questions and, crucially, pinpoint the exact page in an undergraduate mathematics textbook where the answer could be found. This ‘page-level retrieval’ is vital for building reliable AI tutoring solutions that can provide students with precise references.

Understanding RAG and GraphRAG

RAG is a method that combines information retrieval with LLMs. When a student asks a question, a ‘retriever’ module first identifies relevant documents or pages from a database (like a textbook). This retrieved content is then fed to a ‘generator’ module (an LLM) along with the student’s query to produce an answer. This two-stage process helps ground the LLM’s responses in specific, verified information, reducing the chances of hallucination.

GraphRAG is an extension of RAG that uses a knowledge graph. Instead of just indexing unstructured documents, GraphRAG builds a network of entities (concepts, objects) and their relationships extracted from the text. This structured approach aims to capture interconnected concepts and hierarchical knowledge, which could be particularly useful in subjects like mathematics where definitions and theorems are highly linked.

The Study’s Approach

The researchers curated a unique dataset of 477 question-answer pairs, each linked to a specific page from the textbook “An Infinite Descent into Pure Mathematics.” They then compared standard embedding-based RAG methods (which use numerical representations of text to find similar content) against GraphRAG. The evaluation focused on two key metrics: retrieval accuracy (whether the correct page was identified) and generated answer quality (measured by F1 scores, which assess how well the AI’s answer matches the correct answer).

Key Findings

Surprisingly, the study found that embedding-based RAG generally outperformed GraphRAG. For retrieval accuracy, one of the RAG models, ‘voyage-3-large’, achieved a remarkable 99% accuracy when allowed to retrieve up to 10 pages. GraphRAG also showed good accuracy (between 84% and 91%), but it retrieves entire entities rather than a ranked list of pages, making direct comparison slightly different.

In terms of generated answer quality, RAG approaches consistently improved F1 scores compared to a baseline LLM that didn’t use retrieval. However, GraphRAG showed lower F1 scores than the embedding-based RAG models. This suggests that while GraphRAG is good at finding conceptually linked entities, its impact on generating precise answers was not superior in this context.

The researchers also explored ‘re-ranking’ retrieved pages using an LLM to improve performance. However, this yielded mixed results. In some cases, it improved accuracy, but more often, especially with larger sets of retrieved documents, it led to performance drops and even instances of the LLM generating incorrect or non-existent page numbers (hallucinations).

Why GraphRAG Underperformed

The study identified two main reasons for GraphRAG’s lower performance in this specific task. Firstly, GraphRAG tended to retrieve an excessive amount of content, often including irrelevant information. On average, GraphRAG retrieved about 46,949 tokens per question, whereas a top-5 RAG retrieval only used about 3,743 tokens. This ‘noise’ in the context window can dilute precision and make it harder for the LLM to generate a focused answer.

Secondly, the page-level structure of the textbook, with questions tied to single pages, didn’t align well with GraphRAG’s entity-based representation. This mismatch meant the generative model had to process unwieldy or redundant contexts, impacting the quality of its output.

Also Read:

Implications for AI Tutoring

This research highlights both the promise and challenges of using AI for page-level retrieval in educational settings. While RAG-based systems can significantly enhance the quality of answers for math textbook content, achieving perfect page-level retrieval remains a challenge. The findings emphasize the need for careful design choices, such as effective ‘chunking’ of information into pages and controlling the length of the context provided to the LLM, to build reliable AI tutors that can not only answer questions but also accurately point students to the relevant textbook pages.

The researchers have open-sourced their dataset and code to encourage further research in this area. You can find more details about their work in the full paper: Comparing RAG and GraphRAG for Page-Level Retrieval Question Answering on Math Textbook.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing Math Textbook Learning: A Deep Dive into AI Retrieval Systems

Understanding RAG and GraphRAG

The Study’s Approach

Key Findings

Why GraphRAG Underperformed

Implications for AI Tutoring

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates