spot_img
HomeResearch & DevelopmentBridging Language Models and Knowledge Graphs for Biomedical Understanding

Bridging Language Models and Knowledge Graphs for Biomedical Understanding

TLDR: BALI (Biomedical Knowledge Graph and Language Model Alignment) is a novel pre-training method that enhances biomedical language models by aligning their representations with external knowledge graphs. This approach improves the models’ comprehension of complex, domain-specific concepts and factual information, leading to significant performance gains in biomedical question answering, entity linking, and relation extraction tasks, even with minimal pre-training.

In the rapidly evolving field of artificial intelligence, Language Models (LMs) have made significant strides in understanding and processing human language. However, when it comes to highly specialized domains like biomedicine, even advanced LMs often struggle with the intricate structures of concepts and the vast amount of factual information stored in biomedical Knowledge Graphs (KGs).

A new research paper introduces BALI (Biomedical Knowledge Graph and Language Model Alignment), a novel approach designed to bridge this gap. BALI enhances existing biomedical LMs by simultaneously training a dedicated KG encoder and aligning the representations of both the language model and the knowledge graph. This method allows LMs to gain a deeper comprehension of complex, domain-specific information.

The core idea behind BALI is to link specific biomedical concept mentions within a textual sequence to their corresponding entries in a comprehensive KG, such as the Unified Medical Language System (UMLS). It then uses local subgraphs from the KG as ‘cross-modal positive samples’ for these textual mentions. Essentially, it teaches the language model to see the same concept in both text and graph formats, and to understand that they represent the same underlying entity.

How BALI Works

BALI employs a two-pronged approach for representation learning. First, a pre-trained language model processes textual sequences, extracting ‘entity representations’ by pooling the embeddings of tokens related to a specific concept. Second, a Graph Neural Network (GNN), specifically a Graph Attention Network (GAT), is used to encode the structural information of local KG subgraphs, generating ‘subgraph node representations’. Crucially, the initial input to the GNN includes semantic information from the LM-encoded concept names, allowing for a rich, combined understanding.

Alternatively, for larger language models, the KG subgraphs can be ‘linearized’ into textual strings and then encoded directly by the LM, eliminating the need for a separate GNN. This flexibility allows BALI to adapt to different model capacities.

The training process involves two main objectives: Masked Language Modeling (MLM), a standard technique for language model pre-training, and a ‘Cross-Modal Alignment’ objective. The alignment objective uses a contrastive learning method (InfoNCE loss) to pull the textual and graph representations of the same biomedical concept closer together in a shared embedding space. This joint training ensures that the LM not only maintains its language understanding capabilities but also enriches its entity representations with external knowledge from the KG.

A significant advantage of BALI is that after the pre-training phase, the GNN component can be discarded. The enhanced language model then retains the distilled factual domain-specific knowledge, making it more efficient for downstream tasks without requiring real-time KG retrieval during inference.

Empirical Findings and Impact

The researchers conducted extensive experiments, pre-training several leading biomedical LMs like PubMedBERT and BioLinkBERT with BALI, using a dataset of PubMed scientific abstracts and the UMLS KG. The results were compelling:

  • BALI consistently improved performance on various biomedical Question Answering (QA) tasks, including PubMedQA, MedQA, and BioASQ. For instance, PubMedBERT showed mean accuracy improvements of 2.1% on PubMedQA, 1.7% on MedQA, and a notable 6.2% on BioASQ.
  • The method significantly enhanced the quality of entity representations, leading to substantial gains in Entity Linking (EL) capabilities across multiple datasets (NCBI, BC5CDR-D, BC5CDR-C, BC2GN, SMM4H), particularly in zero-shot settings for general-purpose LMs.
  • Improvements were also observed in Relation Extraction tasks (ChemProt, DDI, GAD), highlighting BALI’s ability to foster a more nuanced understanding of relationships between biomedical entities.

Interestingly, BALI-enhanced models, even without access to a retrieved KG subgraph at inference time, performed on par with or even better than task-specific models that explicitly use KG subgraphs during reasoning. This demonstrates the effectiveness of integrating knowledge during pre-training.

While large general-purpose language models like GPT-4 still achieve higher scores on some benchmarks, BALI helps domain-specific LMs close the performance gap significantly, proving the importance of specialized models for biomedical applications.

Further analysis revealed that Graph Attention Networks (GAT) were more effective for graph encoding than simpler methods like GraphSAGE, and that the joint training objectives (MLM and alignment) were both crucial for optimal performance. The study also confirmed that providing additional local graph context is vital, as purely textual node representations performed poorly.

Also Read:

Conclusion

BALI represents a promising step forward in biomedical Natural Language Processing. By aligning language models with knowledge graphs through a self-supervised pre-training method, it enables LMs to develop a more robust and factually aware understanding of the biomedical domain. This approach offers a cost-effective way to enhance existing models, leading to improved performance on critical tasks like question answering and entity linking. The research team plans to expand this method to general domains and other LM architectures in the future. You can read the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -