Unlocking Drug Information: How AI Creates Knowledge Graphs from Leaflets

TLDR: The research introduces MEDAKA, a new biomedical knowledge graph (KG) constructed from publicly available drug leaflets using an end-to-end pipeline. This pipeline leverages a web scraper to collect unstructured text and a large language model (LLM), LLaMA 3.3 70B, for information extraction, including a majority voting mechanism for reliability. MEDAKA captures clinically relevant drug attributes like side effects, warnings, contraindications, ingredients, dosage, and storage instructions, which are often missing in existing biomedical KGs. Evaluated through human inspection and an LLM-as-a-judge framework, MEDAKA demonstrates high accuracy and broader coverage compared to other databases, offering a valuable resource for patient safety monitoring and drug recommendation.

Knowledge graphs (KGs) are becoming increasingly vital for organizing complex biomedical information into structured, easily interpretable formats. However, many existing biomedical KGs tend to focus on specific areas like molecular interactions or adverse events, often overlooking the rich, practical data found in everyday drug leaflets.

A new research paper introduces MEDAKA, a novel approach to constructing comprehensive biomedical KGs. The authors, Asmita Sengupta, David A. Selby, Sebastian J. Vollmer, and Gerrit Großmann, present an innovative, end-to-end pipeline that transforms unstructured online content into a structured knowledge graph using a web scraper and a large language model (LLM). This work also unveils the MEDAKA dataset itself, which was generated by applying this method to thousands of publicly available drug leaflets.

The MEDAKA Pipeline: From Leaflet to Knowledge Graph

The core of this research is a modular and reproducible pipeline designed to automatically build KGs. It consists of three main stages:

1. Data Collection: The process begins by gathering raw data. For MEDAKA, approximately 13,000 drug leaflets related to human medicine were collected from the Health Products Regulatory Authority (HPRA) in Ireland. A lightweight web scraping script, built with Python’s BeautifulSoup, extracted direct PDF download links. These PDFs were then parsed into plain text using the PyMuPDF library.

2. LLM-based Information Extraction: To convert the parsed text into a structured representation, a prompt-based LLM pipeline was developed. The LLaMA 3.3 70B Instruct model was chosen due to its ability to handle long context windows, allowing complete single-pass processing of entire drug leaflets. The LLM was tasked with outputting subject–relation–object triples, focusing on clinically relevant attributes such as drug name, side effects, ingredients, warnings, contraindications, dosage guidelines, storage instructions, and physical characteristics. To ensure reliability and mitigate potential LLM hallucinations, a majority voting strategy was employed: each leaflet was processed five times, and only triples present in three or more generations were retained.

3. KG Construction: In the final step, the filtered triples were normalized by converting subjects, relations, and objects to lowercase, forming the MEDAKA knowledge graph. The schema of MEDAKA defines key node types like Drug, ActiveIngredient, SideEffect, Warning, and Contraindication, connected by relations such as ‘hassideeffect’ and ‘haswarning’.

The MEDAKA Dataset: A Rich Resource

The MEDAKA dataset is a significant outcome of this pipeline, consisting of 41,142 nodes and 466,359 directed, labeled edges. It captures a wide array of connections between drug entities and biomedical concepts. Crucially, MEDAKA includes attributes often missing in existing medical databases, such as contraindications, warnings, storage conditions, and physical characteristics of drugs. This makes it a uniquely comprehensive resource for real-world drug knowledge.

Rigorous Evaluation

The quality of MEDAKA was assessed using three strategies:

1. Human Evaluation: A random sample of 100 drug leaflets, yielding 3,549 extracted relations, was manually reviewed. This evaluation found 96.6% of the triples to be correct, demonstrating high precision.

2. LLM-as-a-Judge Evaluation: The same set of triples was evaluated using an LLM-as-a-judge framework with the gpt-oss-120b model. This automated assessment closely matched human judgments, with 96.9% of triples deemed correct, suggesting that LLM-based judgment can be an efficient alternative for large-scale quality assessment.

3. Coverage Comparison: MEDAKA’s coverage was compared against established biomedical resources like DrugBank, SIDER, and FAERS. The comparison highlighted that MEDAKA captures additional critical attributes such as warnings, contraindications, storage instructions, and physical characteristics, which are often absent in these other databases.

Also Read:

Impact and Future Directions

MEDAKA is expected to support crucial tasks such as patient safety monitoring and drug recommendation. The modular design of the pipeline also means it can be adapted to construct KGs from unstructured texts in other domains. While the approach is promising, the authors acknowledge limitations related to the reliability of drug leaflets and variability in website structures. Future work could involve enriching MEDAKA with disease-related information and extending the pipeline to multilingual corpora to enhance its global relevance.

For more details on this innovative work, you can access the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Drug Information: How AI Creates Knowledge Graphs from Leaflets

The MEDAKA Pipeline: From Leaflet to Knowledge Graph

The MEDAKA Dataset: A Rich Resource

Rigorous Evaluation

Impact and Future Directions

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Get Well and RhythmX AI Unite to Form GW RhythmX, Pioneering AI-Native Healthcare Intelligence

Arya Health Secures $18.2 Million to Revolutionize Post-Acute Care Administration with AI Agents

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates