spot_img
HomeResearch & DevelopmentAdapting Language Models to New Domains with Instruction-Knowledge-Aware Continual...

Adapting Language Models to New Domains with Instruction-Knowledge-Aware Continual Pretraining

TLDR: IKnow is a new framework for continually pretraining instruction-tuned large language models on new domains without needing access to the original base model or external knowledge bases. It uses instruction-response formatted self-supervised objectives like Masked Token Prediction, Masked Phrase Prediction, and Natural Language to Knowledge Graph conversion to maintain instruction-following ability and enhance semantic understanding, showing promising results in question answering tasks.

Large language models (LLMs) have become fundamental to many modern AI applications, excelling in a wide range of tasks. However, a significant challenge arises when these models are deployed in new domains that differ from their original training data. This ‘domain drift’ can lead to a noticeable drop in performance, particularly for instruction-tuned models, which may lose their ability to follow instructions and accurately represent semantic information.

Traditional solutions to this problem, such as continued pretraining, often fall short. Naively applying standard self-supervised objectives can degrade an instruction-tuned model’s capabilities, a phenomenon known as catastrophic forgetting. Existing fixes typically require access to the original base model weights, which are often withheld for safety reasons, or rely on external domain-specific databases, which might not always be available or compatible.

Introducing IKnow: A Novel Approach to Domain Adaptation

To address these limitations, researchers have proposed a new framework called Instruction-Knowledge-Aware Continual Adaptation (IKnow). This innovative approach offers a simple and general way to adapt LLMs to new domains using only unlabeled test-time data, without needing the base model or external knowledge sources. IKnow’s core idea is to leverage the domain knowledge already embedded within the text itself and encode it at a deeper semantic level.

The IKnow framework formulates novel self-supervised objectives in an instruction-response dialogue format. This design helps instruction-tuned models retain their instruction-following ability during continual pretraining, a crucial aspect that previous methods struggled with.

How IKnow Works: The Method Behind the Adaptation

IKnow operates in a structured manner, transforming raw text into instruction-tuning training examples:

  • Data Preparation: The framework first processes the unlabeled context data by splitting it into sentences. It then uses off-the-shelf syntactic parsers to extract structural information. A constituency parser identifies phrases (like noun phrases or verb phrases), and a dependency parser derives knowledge graphs by identifying (subject, root, object) relations within sentences.
  • Instruction-Style Objectives: IKnow introduces three distinct pretraining tasks, all formatted as instruction-response dialogues:
    1. Masked Token Prediction (MTP): Similar to standard masked language modeling, but framed as an instruction. The model is asked to complete a masked token in a sentence, and the response is the missing token.
    2. Masked Phrase Prediction (MPP): To enhance understanding of entities and relations, IKnow masks out entire phrases (e.g., a noun phrase). The model is instructed to complete the masked words, and the response is the full phrase. This focuses the model on semantically meaningful spans.
    3. NL↔KG (Natural Language to Knowledge Graph and vice-versa): This task emulates human learning by encouraging bidirectional reasoning between natural language and structured knowledge. For NL→KG, the model is asked to extract knowledge tuples from a text, and it responds with the structured knowledge graph. The KG→NL task works in reverse, asking the model to generate natural language from a knowledge graph.

Experimental Validation and Key Findings

The researchers evaluated IKnow on two knowledge-intensive question answering datasets: RepliQA (news articles) and SciQAG (scientific publications). They tested two different LLMs, Llama-3.2-3B-Instruct and Qwen3-1.7B, using both full-finetuning and LoRA techniques.

The experiments aimed to test two main hypotheses:

  • H1: Instruction-style pretraining tasks improve performance over naive next-token prediction (NTP).
  • H2: Knowledge-intensive tasks (MPP and NLKG) improve performance compared to naive Masked Token Prediction (MTP).

The results largely supported H1, showing that instruction-style pretraining tasks outperformed the naive NTP baseline in 19 out of 24 experimental settings. This indicates that IKnow successfully helps models retain their instruction-following ability. Notably, naive NTP sometimes led to catastrophic forgetting, where the model lost its prior knowledge.

Support for H2 was mixed. While MPP and NLKG yielded substantial performance gains for Llama-3.2-3B, they did not show consistent improvement for Qwen3-1.7B. This discrepancy might be attributed to Qwen3’s design, which emphasizes reasoning, or its smaller parameter count, potentially limiting its capacity to benefit from more sophisticated knowledge acquisition objectives.

Also Read:

Conclusion and Future Directions

IKnow presents a promising framework for continually pretraining instruction-tuned LLMs, effectively addressing the challenge of maintaining instruction-following ability and enhancing semantic understanding in new domains without relying on external resources or base model access. The framework’s ability to formulate self-supervised losses in an instruction-response template is a key innovation.

While the results are encouraging, the researchers acknowledge several limitations. Current evaluations were conducted on full test datasets, and future work will explore performance on smaller subsets or single-sample scenarios. The scope was limited to question answering tasks, suggesting that other tasks might require different pretraining objectives. Additionally, experiments were conducted on relatively smaller models (up to 3 billion parameters), leaving the generalization to larger-scale models as an open question. Ethical considerations also highlight the current focus on high-resource languages like English, with a call for future work to expand evaluation to low-resource languages for broader applicability. You can read the full research paper here: IKnow: Instruction-Knowledge-Aware Continual Pretraining for Effective Domain Adaptation.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -