Advancing Continual Learning in Large Language Models with Mixtures of SubExperts

TLDR: The research paper introduces Mixtures of SubExperts (MoSEs), a novel adaptive Parameter-Efficient Fine-Tuning (PEFT) method for Large Language Models (LLMs) in continual learning. MoSEs integrates sparse sub-experts and a task-specific routing mechanism into transformer layers to minimize catastrophic forgetting and enable efficient knowledge transfer. It achieves sublinear capacity growth and superior performance on the TRACE benchmark datasets compared to existing methods, demonstrating significant improvements in knowledge retention, scalability, and computational efficiency.

Large Language Models (LLMs) have transformed natural language processing, powering everything from summarization to translation. However, these powerful models face a significant hurdle: adapting to new information and tasks continuously without forgetting what they’ve already learned. This challenge, known as continual learning, is crucial for real-world applications where data and user demands are constantly evolving.

Traditionally, updating LLMs for new tasks often leads to ‘catastrophic forgetting,’ where the model loses previously acquired knowledge. While methods like Parameter-Efficient Fine-Tuning (PEFT), such as LoRA, help by only modifying a small subset of parameters, they still struggle with interference between tasks. Another approach, Mixture of Experts (MoE) architectures, increases model capacity by using specialized subnetworks, but they haven’t been fully successful in preventing expert overlap and degradation in sequential learning scenarios.

Introducing Mixtures of SubExperts (MoSEs)

To overcome these limitations, researchers have proposed a novel adaptive PEFT method called Mixtures of SubExperts (MoSEs). This framework is specifically designed for continual learning in LLMs, aiming for minimal forgetting and efficient scalability. MoSEs integrates a sparse Mixture of SubExperts directly into the transformer layers of an LLM. A clever task-specific routing mechanism governs these sub-experts.

The core idea behind MoSEs is to isolate and protect knowledge within dedicated ‘SubExperts.’ This significantly reduces parameter interference and, consequently, catastrophic forgetting. What makes MoSEs particularly innovative is its router, which can adaptively select and combine previously learned sparse parameters for new tasks. This not only enables effective knowledge transfer between related tasks but also ensures that the model’s capacity grows sublinearly, meaning it doesn’t balloon in size as more tasks are added.

How MoSEs Works

MoSEs augments pre-trained LLM parameters with sparse sub-experts in selected attention layers. Each MoSE layer contains a pool of sub-neural experts and a trainable sparse routing function. When an input comes in, the sparse router calculates a score for each expert and activates only the most relevant ones. This sparse selection ensures computational efficiency and localized updates.

For each new task, MoSEs assigns a unique sparse routing mask, activating a distinct subset of experts. This helps prevent forgetting by ensuring that new tasks don’t overwrite knowledge critical for older ones. The framework also uses task-specific prompts and a ‘pull constraint’ during training. This constraint ensures that the selected prompts remain semantically aligned with the input features, helping the model dynamically adapt to task semantics without needing explicit task labels during inference.

Also Read:

Performance and Efficiency

The effectiveness of MoSEs was rigorously evaluated on the comprehensive TRACE benchmark datasets, which include a diverse range of language tasks. The experiments demonstrated that MoSEs significantly outperforms conventional continual learning approaches, including LoRA and MoE, in both knowledge retention and scalability. It achieves state-of-the-art performance while offering substantial memory and computational savings.

For instance, in task-agnostic continual learning, MoSEs showed superior average performance and remarkably minimal backward transfer (a measure of forgetting), compared to much higher forgetting rates in other methods. Crucially, MoSEs achieved this with fewer trainable parameters and reduced test-time latency, highlighting its efficiency. The research also explored different configurations, finding an optimal balance between the number of experts, sparsity, and layer-wise tuning to maximize performance and minimize forgetting.

In conclusion, MoSEs offers a promising direction for building scalable and memory-efficient LLMs that can robustly adapt to a continuous stream of new information without suffering from catastrophic forgetting. For more in-depth technical details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Advancing Continual Learning in Large Language Models with Mixtures of SubExperts

Introducing Mixtures of SubExperts (MoSEs)

How MoSEs Works

Performance and Efficiency

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Avalara Secures $500 Million Investment from BlackRock to Propel AI-Powered Tax Automation

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates