TLDR: This paper presents an optimized pipeline for automatically constructing Educational Knowledge Graphs (EduKGs) from PDF learning materials. The initial pipeline, while functional, had low accuracy and efficiency. Through a series of optimizations including a worker-based architecture, offline data preprocessing, enhanced text extraction, improved concept annotation with disambiguation and pruning, and local concept expansion, the new pipeline achieves a 17.5% increase in accuracy and over a tenfold improvement in processing efficiency. This makes EduKG construction faster, more reliable, and adaptable for various educational platforms like MOOCs.
In the rapidly evolving landscape of digital education, organizing vast amounts of learning material into easily digestible and interconnected knowledge structures is crucial. This is where Educational Knowledge Graphs, or EduKGs, come into play. An EduKG is essentially a structured representation of educational content, where concepts (like “photosynthesis” or “algebra”) are nodes, and the relationships between them (like “is a prerequisite for” or “is related to”) are edges. These graphs are vital for personalized learning, adaptive education systems, and enhancing how we interact with learning materials, especially in platforms like Massive Open Online Courses (MOOCs) and Learning Management Systems (LMSs).
Traditionally, creating these knowledge graphs has been a manual, labor-intensive process, often requiring domain experts. However, with the explosion of digital content, there’s a growing need for automated, scalable methods. While machine learning and natural language processing (NLP) techniques have shown promise, and Large Language Models (LLMs) are emerging as powerful tools, they still face challenges like processing large datasets, ensuring accuracy, and managing computational resources.
A recent study by Qurat Ul Ain, Mohamed Amine Chatti, Jean Qussa, Amr Shakhshir, Rawaa Alatrash, and Shoeb Joarder introduces and optimizes a pipeline for the automatic construction of EduKGs from PDF learning materials. The initial pipeline was designed to generate slide-level EduKGs from individual pages, which are then merged to form a comprehensive EduKG for the entire learning material. This process was evaluated on their MOOC platform, CourseMapper.
The Initial EduKG Construction Pipeline
The original pipeline involved several key steps:
Text Extraction: This first step involved extracting text from PDF pages using PDFMiner, which attempts to preserve the logical reading order despite diverse PDF layouts.
Keyphrase Extraction: After text extraction, keyphrases were identified. The researchers compared various methods, including LLM-based approaches (Zero-Shot, Few-Shot, LLM-KeyBERT) and traditional techniques (SingleRank, PatternRank, SIFRank variants). While LLM-based methods, particularly Few-Shot learning, showed higher accuracy, traditional methods like SIFRank SqueezeBERT offered a better balance of accuracy and efficiency, especially when computational resources were a concern. The pipeline extracted 15 keyphrases per slide.
Concept Identification: Keyphrases were then linked to external knowledge bases, specifically DBpedia, using DBpedia Spotlight to identify “Main Concepts” (MCs). To address potential inaccuracies from automated linking, a concept-weighting strategy was introduced. This strategy used transformer-based methods (wSBERT) to compute relevance based on cosine similarity between the learning material, Wikipedia articles, and slide text. These slide-level EduKGs were stored in a Neo4j database, allowing learners to access them even before the full material’s EduKG was complete.
Concept Expansion: To increase the coverage and diversity of the EduKG, the identified concepts were expanded using two strategies based on DBpedia’s semantic relationships: related concept expansion (using dbo:wikiPageWikiLink) and category-based expansion (using dct:subject). These expanded concepts were also weighted and ranked to ensure relevance.
However, the initial evaluation of this pipeline revealed a relatively low accuracy of approximately 40%, highlighting a critical need for improvements, especially given the importance of reliable knowledge representation in educational contexts.
Optimizing for Accuracy and Efficiency
To address these limitations, the researchers proposed a series of targeted optimizations across multiple components of the pipeline:
Worker-based Architecture: The initial API-based implementation struggled with scalability and error recovery. A new worker-based architecture was introduced, utilizing job queues and multiple workers. This design allows for horizontal scalability, automatic re-queuing of failed jobs, and better handling of simultaneous requests, with Redis serving as the message broker.
Data Preprocessing: A significant bottleneck was the reliance on the Wikipedia API, causing latency due to numerous network requests. The optimized pipeline introduces an offline data preprocessing step, executed monthly. This extracts relevant data (article names, abstracts, links, categories) from a Wikipedia XML dump and generates embeddings using the SBERT all-mpnet-base-v2 model. This data is stored locally in a PostgreSQL database, providing near-instant access and reducing network dependencies.
Optimization of Text Extraction: The original PDFMiner approach often included noisy elements like titles, page numbers, and footers. The enhanced text extraction module now includes structured analysis using font size analysis, text distance analysis, text similarity analysis (to filter repetitive content), and bullet point analysis to ensure cleaner, more accurate text segmentation and content filtering. This significantly improved the precision, recall, and F1 scores for content extraction.
Optimization of Concept Annotation: To improve the accuracy of concept linking, a concept disambiguation module was integrated. This module checks if a DBpedia-annotated concept is a disambiguation page on Wikipedia. If so, it considers alternative concepts linked from that page and re-weights them using the concept weighting strategy, replacing the original annotation with the highest-weighted alternative. Additionally, a knowledge graph pruning step was added to remove irrelevant concepts (those with a weight below a threshold of 0.192), further enhancing accuracy. The all-mpnet-base-v2 embedding model was chosen for its superior performance in concept-material similarity.
Optimization of Concept Expansion: Instead of querying public SPARQL endpoints for related concepts (RCs) and categories (Cts), the optimized pipeline retrieves this information from our locally hosted Wikipedia dump. This ensures faster, more reliable access. Related concept expansion now involves creating a candidate set from linked Wikipedia pages, weighting them by cosine similarity with the learning material, and selecting the top 20. Category expansion identifies categories of main concepts, weights them based on normalized category weight and connected concepts weight, and retains the top 5.
Evaluation and Results
The optimized pipeline was rigorously evaluated for both efficiency and accuracy. In terms of efficiency, the optimized pipeline demonstrated remarkable improvements:
EduKG Construction (Mean time per slide): Over 10 times faster, reducing from 24.35 seconds to 2.3 seconds.
EduKG Expansion (Mean time per concept): Over 100 times faster, reducing from 222 seconds to 1.89 seconds.
These significant gains are primarily attributed to the local Wikipedia database with precomputed embeddings, which drastically reduced network latency and computational overhead.
For accuracy, using the same evaluation methodology with expert annotators, the optimized pipeline achieved a 17.5% improvement, raising the accuracy from 0.4 to 0.47. This improvement was largely due to cleaner text extraction, effective concept disambiguation, and knowledge graph pruning, along with the use of the high-quality all-mpnet-base-v2 embedding model. While the overall accuracy increase was moderate, partly due to continued reliance on DBpedia as the primary external knowledge base, the enhancements significantly improved the reliability and semantic coherence of the generated EduKGs.
Also Read:
- Unlocking Learning Sequences: Inferring Prerequisites in Educational Knowledge Graphs
- Boosting Classroom LLMs: A Comparative Look at AI Retrieval for Accurate Learning
Conclusion and Future Directions
This research presents a robust, end-to-end pipeline for automatically constructing Educational Knowledge Graphs from PDF learning materials. The optimized pipeline, implemented within the CourseMapper platform, offers a state-of-the-art solution that is scalable, efficient, and adaptable to various educational contexts. It achieves substantial improvements in both processing speed and accuracy, making it a valuable tool for enhancing technology-enhanced learning systems.
Future work includes optimizing database interactions, supporting incremental updates for the Wikipedia dump, improving text extraction for complex elements like tables and figures, and developing a more comprehensive and up-to-date evaluation dataset. The researchers also plan to integrate a “human-in-the-loop” mechanism, allowing domain experts to refine and validate the automatically generated EduKGs, balancing automation with expert oversight. For more details, you can refer to the full research paper here.


