Adapting Language Models to New Domains with Instruction-Knowledge-Aware Continual Pretraining

TLDR: IKnow is a new framework for continually pretraining instruction-tuned large language models on new domains without needing access to the original base model or external knowledge bases. It uses instruction-response formatted self-supervised objectives like Masked Token Prediction, Masked Phrase Prediction, and Natural Language to Knowledge Graph conversion to maintain instruction-following ability and enhance semantic understanding, showing promising results in question answering tasks.

Large language models (LLMs) have become fundamental to many modern AI applications, excelling in a wide range of tasks. However, a significant challenge arises when these models are deployed in new domains that differ from their original training data. This ‘domain drift’ can lead to a noticeable drop in performance, particularly for instruction-tuned models, which may lose their ability to follow instructions and accurately represent semantic information.

Traditional solutions to this problem, such as continued pretraining, often fall short. Naively applying standard self-supervised objectives can degrade an instruction-tuned model’s capabilities, a phenomenon known as catastrophic forgetting. Existing fixes typically require access to the original base model weights, which are often withheld for safety reasons, or rely on external domain-specific databases, which might not always be available or compatible.

Introducing IKnow: A Novel Approach to Domain Adaptation

To address these limitations, researchers have proposed a new framework called Instruction-Knowledge-Aware Continual Adaptation (IKnow). This innovative approach offers a simple and general way to adapt LLMs to new domains using only unlabeled test-time data, without needing the base model or external knowledge sources. IKnow’s core idea is to leverage the domain knowledge already embedded within the text itself and encode it at a deeper semantic level.

The IKnow framework formulates novel self-supervised objectives in an instruction-response dialogue format. This design helps instruction-tuned models retain their instruction-following ability during continual pretraining, a crucial aspect that previous methods struggled with.

How IKnow Works: The Method Behind the Adaptation

IKnow operates in a structured manner, transforming raw text into instruction-tuning training examples:

Data Preparation: The framework first processes the unlabeled context data by splitting it into sentences. It then uses off-the-shelf syntactic parsers to extract structural information. A constituency parser identifies phrases (like noun phrases or verb phrases), and a dependency parser derives knowledge graphs by identifying (subject, root, object) relations within sentences.
Instruction-Style Objectives: IKnow introduces three distinct pretraining tasks, all formatted as instruction-response dialogues:

Masked Token Prediction (MTP): Similar to standard masked language modeling, but framed as an instruction. The model is asked to complete a masked token in a sentence, and the response is the missing token.
Masked Phrase Prediction (MPP): To enhance understanding of entities and relations, IKnow masks out entire phrases (e.g., a noun phrase). The model is instructed to complete the masked words, and the response is the full phrase. This focuses the model on semantically meaningful spans.
NL↔KG (Natural Language to Knowledge Graph and vice-versa): This task emulates human learning by encouraging bidirectional reasoning between natural language and structured knowledge. For NL→KG, the model is asked to extract knowledge tuples from a text, and it responds with the structured knowledge graph. The KG→NL task works in reverse, asking the model to generate natural language from a knowledge graph.

Experimental Validation and Key Findings

The researchers evaluated IKnow on two knowledge-intensive question answering datasets: RepliQA (news articles) and SciQAG (scientific publications). They tested two different LLMs, Llama-3.2-3B-Instruct and Qwen3-1.7B, using both full-finetuning and LoRA techniques.

The experiments aimed to test two main hypotheses:

H1: Instruction-style pretraining tasks improve performance over naive next-token prediction (NTP).
H2: Knowledge-intensive tasks (MPP and NLKG) improve performance compared to naive Masked Token Prediction (MTP).

The results largely supported H1, showing that instruction-style pretraining tasks outperformed the naive NTP baseline in 19 out of 24 experimental settings. This indicates that IKnow successfully helps models retain their instruction-following ability. Notably, naive NTP sometimes led to catastrophic forgetting, where the model lost its prior knowledge.

Support for H2 was mixed. While MPP and NLKG yielded substantial performance gains for Llama-3.2-3B, they did not show consistent improvement for Qwen3-1.7B. This discrepancy might be attributed to Qwen3’s design, which emphasizes reasoning, or its smaller parameter count, potentially limiting its capacity to benefit from more sophisticated knowledge acquisition objectives.

Also Read:

Conclusion and Future Directions

IKnow presents a promising framework for continually pretraining instruction-tuned LLMs, effectively addressing the challenge of maintaining instruction-following ability and enhancing semantic understanding in new domains without relying on external resources or base model access. The framework’s ability to formulate self-supervised losses in an instruction-response template is a key innovation.

While the results are encouraging, the researchers acknowledge several limitations. Current evaluations were conducted on full test datasets, and future work will explore performance on smaller subsets or single-sample scenarios. The scope was limited to question answering tasks, suggesting that other tasks might require different pretraining objectives. Additionally, experiments were conducted on relatively smaller models (up to 3 billion parameters), leaving the generalization to larger-scale models as an open question. Ethical considerations also highlight the current focus on high-resource languages like English, with a call for future work to expand evaluation to low-resource languages for broader applicability. You can read the full research paper here: IKnow: Instruction-Knowledge-Aware Continual Pretraining for Effective Domain Adaptation.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Adapting Language Models to New Domains with Instruction-Knowledge-Aware Continual Pretraining

Introducing IKnow: A Novel Approach to Domain Adaptation

How IKnow Works: The Method Behind the Adaptation

Experimental Validation and Key Findings

Conclusion and Future Directions

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates