TLDR: The research paper introduces Mixtures of SubExperts (MoSEs), a novel adaptive Parameter-Efficient Fine-Tuning (PEFT) method for Large Language Models (LLMs) in continual learning. MoSEs integrates sparse sub-experts and a task-specific routing mechanism into transformer layers to minimize catastrophic forgetting and enable efficient knowledge transfer. It achieves sublinear capacity growth and superior performance on the TRACE benchmark datasets compared to existing methods, demonstrating significant improvements in knowledge retention, scalability, and computational efficiency.
Large Language Models (LLMs) have transformed natural language processing, powering everything from summarization to translation. However, these powerful models face a significant hurdle: adapting to new information and tasks continuously without forgetting what they’ve already learned. This challenge, known as continual learning, is crucial for real-world applications where data and user demands are constantly evolving.
Traditionally, updating LLMs for new tasks often leads to ‘catastrophic forgetting,’ where the model loses previously acquired knowledge. While methods like Parameter-Efficient Fine-Tuning (PEFT), such as LoRA, help by only modifying a small subset of parameters, they still struggle with interference between tasks. Another approach, Mixture of Experts (MoE) architectures, increases model capacity by using specialized subnetworks, but they haven’t been fully successful in preventing expert overlap and degradation in sequential learning scenarios.
Introducing Mixtures of SubExperts (MoSEs)
To overcome these limitations, researchers have proposed a novel adaptive PEFT method called Mixtures of SubExperts (MoSEs). This framework is specifically designed for continual learning in LLMs, aiming for minimal forgetting and efficient scalability. MoSEs integrates a sparse Mixture of SubExperts directly into the transformer layers of an LLM. A clever task-specific routing mechanism governs these sub-experts.
The core idea behind MoSEs is to isolate and protect knowledge within dedicated ‘SubExperts.’ This significantly reduces parameter interference and, consequently, catastrophic forgetting. What makes MoSEs particularly innovative is its router, which can adaptively select and combine previously learned sparse parameters for new tasks. This not only enables effective knowledge transfer between related tasks but also ensures that the model’s capacity grows sublinearly, meaning it doesn’t balloon in size as more tasks are added.
How MoSEs Works
MoSEs augments pre-trained LLM parameters with sparse sub-experts in selected attention layers. Each MoSE layer contains a pool of sub-neural experts and a trainable sparse routing function. When an input comes in, the sparse router calculates a score for each expert and activates only the most relevant ones. This sparse selection ensures computational efficiency and localized updates.
For each new task, MoSEs assigns a unique sparse routing mask, activating a distinct subset of experts. This helps prevent forgetting by ensuring that new tasks don’t overwrite knowledge critical for older ones. The framework also uses task-specific prompts and a ‘pull constraint’ during training. This constraint ensures that the selected prompts remain semantically aligned with the input features, helping the model dynamically adapt to task semantics without needing explicit task labels during inference.
Also Read:
- SeqTopK: Smarter Expert Allocation in Large Language Models
- TuckA: A New Approach to Efficient AI Model Fine-Tuning with Compact Tensor Experts
Performance and Efficiency
The effectiveness of MoSEs was rigorously evaluated on the comprehensive TRACE benchmark datasets, which include a diverse range of language tasks. The experiments demonstrated that MoSEs significantly outperforms conventional continual learning approaches, including LoRA and MoE, in both knowledge retention and scalability. It achieves state-of-the-art performance while offering substantial memory and computational savings.
For instance, in task-agnostic continual learning, MoSEs showed superior average performance and remarkably minimal backward transfer (a measure of forgetting), compared to much higher forgetting rates in other methods. Crucially, MoSEs achieved this with fewer trainable parameters and reduced test-time latency, highlighting its efficiency. The research also explored different configurations, finding an optimal balance between the number of experts, sparsity, and layer-wise tuning to maximize performance and minimize forgetting.
In conclusion, MoSEs offers a promising direction for building scalable and memory-efficient LLMs that can robustly adapt to a continuous stream of new information without suffering from catastrophic forgetting. For more in-depth technical details, you can refer to the full research paper here.


