spot_img
HomeResearch & DevelopmentUnlocking Drug Information: How AI Creates Knowledge Graphs from...

Unlocking Drug Information: How AI Creates Knowledge Graphs from Leaflets

TLDR: The research introduces MEDAKA, a new biomedical knowledge graph (KG) constructed from publicly available drug leaflets using an end-to-end pipeline. This pipeline leverages a web scraper to collect unstructured text and a large language model (LLM), LLaMA 3.3 70B, for information extraction, including a majority voting mechanism for reliability. MEDAKA captures clinically relevant drug attributes like side effects, warnings, contraindications, ingredients, dosage, and storage instructions, which are often missing in existing biomedical KGs. Evaluated through human inspection and an LLM-as-a-judge framework, MEDAKA demonstrates high accuracy and broader coverage compared to other databases, offering a valuable resource for patient safety monitoring and drug recommendation.

Knowledge graphs (KGs) are becoming increasingly vital for organizing complex biomedical information into structured, easily interpretable formats. However, many existing biomedical KGs tend to focus on specific areas like molecular interactions or adverse events, often overlooking the rich, practical data found in everyday drug leaflets.

A new research paper introduces MEDAKA, a novel approach to constructing comprehensive biomedical KGs. The authors, Asmita Sengupta, David A. Selby, Sebastian J. Vollmer, and Gerrit Großmann, present an innovative, end-to-end pipeline that transforms unstructured online content into a structured knowledge graph using a web scraper and a large language model (LLM). This work also unveils the MEDAKA dataset itself, which was generated by applying this method to thousands of publicly available drug leaflets.

The MEDAKA Pipeline: From Leaflet to Knowledge Graph

The core of this research is a modular and reproducible pipeline designed to automatically build KGs. It consists of three main stages:

1. Data Collection: The process begins by gathering raw data. For MEDAKA, approximately 13,000 drug leaflets related to human medicine were collected from the Health Products Regulatory Authority (HPRA) in Ireland. A lightweight web scraping script, built with Python’s BeautifulSoup, extracted direct PDF download links. These PDFs were then parsed into plain text using the PyMuPDF library.

2. LLM-based Information Extraction: To convert the parsed text into a structured representation, a prompt-based LLM pipeline was developed. The LLaMA 3.3 70B Instruct model was chosen due to its ability to handle long context windows, allowing complete single-pass processing of entire drug leaflets. The LLM was tasked with outputting subject–relation–object triples, focusing on clinically relevant attributes such as drug name, side effects, ingredients, warnings, contraindications, dosage guidelines, storage instructions, and physical characteristics. To ensure reliability and mitigate potential LLM hallucinations, a majority voting strategy was employed: each leaflet was processed five times, and only triples present in three or more generations were retained.

3. KG Construction: In the final step, the filtered triples were normalized by converting subjects, relations, and objects to lowercase, forming the MEDAKA knowledge graph. The schema of MEDAKA defines key node types like Drug, ActiveIngredient, SideEffect, Warning, and Contraindication, connected by relations such as ‘hassideeffect’ and ‘haswarning’.

The MEDAKA Dataset: A Rich Resource

The MEDAKA dataset is a significant outcome of this pipeline, consisting of 41,142 nodes and 466,359 directed, labeled edges. It captures a wide array of connections between drug entities and biomedical concepts. Crucially, MEDAKA includes attributes often missing in existing medical databases, such as contraindications, warnings, storage conditions, and physical characteristics of drugs. This makes it a uniquely comprehensive resource for real-world drug knowledge.

Rigorous Evaluation

The quality of MEDAKA was assessed using three strategies:

1. Human Evaluation: A random sample of 100 drug leaflets, yielding 3,549 extracted relations, was manually reviewed. This evaluation found 96.6% of the triples to be correct, demonstrating high precision.

2. LLM-as-a-Judge Evaluation: The same set of triples was evaluated using an LLM-as-a-judge framework with the gpt-oss-120b model. This automated assessment closely matched human judgments, with 96.9% of triples deemed correct, suggesting that LLM-based judgment can be an efficient alternative for large-scale quality assessment.

3. Coverage Comparison: MEDAKA’s coverage was compared against established biomedical resources like DrugBank, SIDER, and FAERS. The comparison highlighted that MEDAKA captures additional critical attributes such as warnings, contraindications, storage instructions, and physical characteristics, which are often absent in these other databases.

Also Read:

Impact and Future Directions

MEDAKA is expected to support crucial tasks such as patient safety monitoring and drug recommendation. The modular design of the pipeline also means it can be adapted to construct KGs from unstructured texts in other domains. While the approach is promising, the authors acknowledge limitations related to the reliability of drug leaflets and variability in website structures. Future work could involve enriching MEDAKA with disease-related information and extending the pipeline to multilingual corpora to enhance its global relevance.

For more details on this innovative work, you can access the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -