Bridging Language Models and Knowledge Graphs for Biomedical Understanding

TLDR: BALI (Biomedical Knowledge Graph and Language Model Alignment) is a novel pre-training method that enhances biomedical language models by aligning their representations with external knowledge graphs. This approach improves the models’ comprehension of complex, domain-specific concepts and factual information, leading to significant performance gains in biomedical question answering, entity linking, and relation extraction tasks, even with minimal pre-training.

In the rapidly evolving field of artificial intelligence, Language Models (LMs) have made significant strides in understanding and processing human language. However, when it comes to highly specialized domains like biomedicine, even advanced LMs often struggle with the intricate structures of concepts and the vast amount of factual information stored in biomedical Knowledge Graphs (KGs).

A new research paper introduces BALI (Biomedical Knowledge Graph and Language Model Alignment), a novel approach designed to bridge this gap. BALI enhances existing biomedical LMs by simultaneously training a dedicated KG encoder and aligning the representations of both the language model and the knowledge graph. This method allows LMs to gain a deeper comprehension of complex, domain-specific information.

The core idea behind BALI is to link specific biomedical concept mentions within a textual sequence to their corresponding entries in a comprehensive KG, such as the Unified Medical Language System (UMLS). It then uses local subgraphs from the KG as ‘cross-modal positive samples’ for these textual mentions. Essentially, it teaches the language model to see the same concept in both text and graph formats, and to understand that they represent the same underlying entity.

How BALI Works

BALI employs a two-pronged approach for representation learning. First, a pre-trained language model processes textual sequences, extracting ‘entity representations’ by pooling the embeddings of tokens related to a specific concept. Second, a Graph Neural Network (GNN), specifically a Graph Attention Network (GAT), is used to encode the structural information of local KG subgraphs, generating ‘subgraph node representations’. Crucially, the initial input to the GNN includes semantic information from the LM-encoded concept names, allowing for a rich, combined understanding.

Alternatively, for larger language models, the KG subgraphs can be ‘linearized’ into textual strings and then encoded directly by the LM, eliminating the need for a separate GNN. This flexibility allows BALI to adapt to different model capacities.

The training process involves two main objectives: Masked Language Modeling (MLM), a standard technique for language model pre-training, and a ‘Cross-Modal Alignment’ objective. The alignment objective uses a contrastive learning method (InfoNCE loss) to pull the textual and graph representations of the same biomedical concept closer together in a shared embedding space. This joint training ensures that the LM not only maintains its language understanding capabilities but also enriches its entity representations with external knowledge from the KG.

A significant advantage of BALI is that after the pre-training phase, the GNN component can be discarded. The enhanced language model then retains the distilled factual domain-specific knowledge, making it more efficient for downstream tasks without requiring real-time KG retrieval during inference.

Empirical Findings and Impact

The researchers conducted extensive experiments, pre-training several leading biomedical LMs like PubMedBERT and BioLinkBERT with BALI, using a dataset of PubMed scientific abstracts and the UMLS KG. The results were compelling:

BALI consistently improved performance on various biomedical Question Answering (QA) tasks, including PubMedQA, MedQA, and BioASQ. For instance, PubMedBERT showed mean accuracy improvements of 2.1% on PubMedQA, 1.7% on MedQA, and a notable 6.2% on BioASQ.
The method significantly enhanced the quality of entity representations, leading to substantial gains in Entity Linking (EL) capabilities across multiple datasets (NCBI, BC5CDR-D, BC5CDR-C, BC2GN, SMM4H), particularly in zero-shot settings for general-purpose LMs.
Improvements were also observed in Relation Extraction tasks (ChemProt, DDI, GAD), highlighting BALI’s ability to foster a more nuanced understanding of relationships between biomedical entities.

Interestingly, BALI-enhanced models, even without access to a retrieved KG subgraph at inference time, performed on par with or even better than task-specific models that explicitly use KG subgraphs during reasoning. This demonstrates the effectiveness of integrating knowledge during pre-training.

While large general-purpose language models like GPT-4 still achieve higher scores on some benchmarks, BALI helps domain-specific LMs close the performance gap significantly, proving the importance of specialized models for biomedical applications.

Further analysis revealed that Graph Attention Networks (GAT) were more effective for graph encoding than simpler methods like GraphSAGE, and that the joint training objectives (MLM and alignment) were both crucial for optimal performance. The study also confirmed that providing additional local graph context is vital, as purely textual node representations performed poorly.

Also Read:

Conclusion

BALI represents a promising step forward in biomedical Natural Language Processing. By aligning language models with knowledge graphs through a self-supervised pre-training method, it enables LMs to develop a more robust and factually aware understanding of the biomedical domain. This approach offers a cost-effective way to enhance existing models, leading to improved performance on critical tasks like question answering and entity linking. The research team plans to expand this method to general domains and other LM architectures in the future. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Bridging Language Models and Knowledge Graphs for Biomedical Understanding

How BALI Works

Empirical Findings and Impact

Conclusion

Gen AI News and Updates

Jorie AI Unveils SmartCore Engine: Revolutionizing Healthcare Intelligence and Automation

Get Well and RhythmX AI Unite to Form GW RhythmX, Pioneering AI-Native Healthcare Intelligence

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates