TLDR: A new “unlearn-then-learn” strategy, using the IA3 fine-tuning technique and circuit localization, allows for precise and localized knowledge editing in compact LLMs. This two-stage approach effectively suppresses old, conflicting facts while instilling new ones with high accuracy, significantly mitigating catastrophic forgetting and introducing “soft forgetting” for enhanced model control.
Large Language Models (LLMs) are incredibly powerful, but they have a significant challenge: their knowledge is static. This means they can’t easily update information, especially when new facts contradict what they already “know.” This often leads to two major problems: the model resists learning the new fact, and it might forget a lot of unrelated information, a phenomenon known as catastrophic forgetting.
A new research paper introduces an innovative solution called the “unlearn-then-learn” strategy. This approach aims to precisely edit knowledge within LLMs, particularly smaller ones like Microsoft’s Phi-3-mini-4k-instruct. The core idea is to first “unlearn” the old, conflicting information and then “learn” the new, correct fact. This two-stage process is made possible by a technique called Infused Adapter by Inhibiting and Amplifying Inner Activations (IA)3, which is a parameter-efficient fine-tuning (PEFT) method.
Understanding the Strategy
The “unlearn-then-learn” strategy begins with a crucial first step: circuit localization. Imagine an LLM’s brain as a complex network of pathways. This phase identifies the specific internal components or “circuits” responsible for encoding a particular piece of information. For example, if the model “knows” that PyTorch was developed by Meta AI, the researchers pinpointed the exact parts of the model that hold this information. This targeted approach is vital because it allows for very precise interventions, minimizing unintended side effects.
Once these circuits are identified, the strategy proceeds in two stages:
Stage 1: Unlearning the Old Fact. In this stage, the model is trained to suppress the original, conflicting information. Using IA3, the model is taught to respond with uncertainty or a refusal (e.g., “I am not sure who developed PyTorch.”) when asked about the old fact. This effectively neutralizes the deeply ingrained, incorrect knowledge without completely erasing it. The IA3 adapter, which contains the “unlearning” instructions, is then permanently merged into the model’s base weights, preparing it for the next stage.
Stage 2: Learning the New Fact. With the old fact suppressed, the model is now ready to learn the new, correct information. Using the same IA3 method, the model is trained to provide the target modulated fact (e.g., “Google.”) when asked the same question. Because the resistance from the old fact has been removed, the new information can be instilled much more effectively and without interference.
Also Read:
- Keeping Large Language Models Current: A New Framework for Real-Time Knowledge Integration
- Precisely Erasing Concepts from AI Image Generators with UnGuide
Key Findings and Benefits
The researchers conducted rigorous experiments on the Phi-3-mini-4k-instruct model. The results were highly impressive:
- High Accuracy for New Facts: The model achieved a near-perfect 98.50% accuracy in adopting the new, modulated fact.
- Effective Suppression of Old Facts: The original conflicting fact was effectively suppressed, with a 96.00% forget rate.
- Unprecedented Localization: Crucially, the strategy dramatically mitigated catastrophic forgetting. While direct fine-tuning methods showed as low as 20% retention of unrelated knowledge, this “unlearn-then-learn” approach achieved a remarkable 72.00% retention rate for general knowledge. This means the edits were highly localized, preserving the vast majority of the model’s other capabilities.
The paper also introduces the concept of “soft forgetting.” This isn’t a complete erasure of the old knowledge but rather a controlled suppression. The original fact remains latent and can be conditionally accessed, enhancing model safety and control. This means the model won’t stubbornly stick to old, incorrect information but also won’t completely lose it, allowing for more nuanced responses if needed.
This research represents a significant step forward in managing knowledge within LLMs, especially compact ones. By combining mechanistic interpretability (understanding how the model works internally) with a two-stage “unlearn-then-learn” process using IA3, the authors have provided a powerful, efficient, and responsible solution for dynamic knowledge updates. For more details, you can read the full research paper here.


