TLDR: LiLoRA (LoRA in LoRA) is a novel method for Continual Visual Instruction Tuning (CVIT) in Multimodal Large Language Models (MLLMs) that addresses catastrophic forgetting and parameter overhead. It achieves this by sharing LoRA matrix A across tasks, applying a low-rank decomposition to matrix B for further efficiency, and introducing a cosine-regularized stability loss to preserve shared representations. Experiments show LiLoRA achieves superior performance and significant parameter efficiency compared to existing methods, making MLLMs more scalable for continuous learning.
In the rapidly evolving landscape of artificial intelligence, Multimodal Large Language Models (MLLMs) have emerged as powerful tools capable of handling complex vision-language tasks, from answering questions about images to generating captions. These models are typically trained in stages, with a crucial step being visual instruction tuning, which aligns their outputs with human intent.
However, real-world applications demand that these MLLMs continuously learn new capabilities without being retrained from scratch. This is where Continual Visual Instruction Tuning (CVIT) comes into play, allowing models to incrementally acquire new vision-language tasks over time. A significant hurdle in CVIT is ‘catastrophic forgetting,’ where the model’s performance on previously learned tasks degrades as it adapts to new ones.
One common strategy to combat this forgetting is architecture expansion, which involves adding task-specific modules to prevent interference between old and new knowledge. While effective, existing methods often expand entire layers for each new task, leading to a massive increase in the number of parameters and poor scalability, especially in large-scale scenarios.
Introducing LiLoRA: A Parameter-Efficient Solution
To address these challenges, researchers have introduced LoRA in LoRA (LiLoRA), a highly efficient architecture expansion method specifically designed for CVIT in MLLMs. LiLoRA builds upon the concept of Low-Rank Adaptation (LoRA), which fine-tunes models by introducing small, trainable low-rank matrices (A and B) instead of modifying the entire model.
LiLoRA’s innovation lies in three key areas:
First, it shares the LoRA matrix A across all tasks. Empirical analysis revealed that matrix A often converges to similar structures across different tasks, indicating redundancy. By sharing this matrix, LiLoRA significantly reduces the number of new parameters needed for each task.
Second, to further enhance parameter efficiency, LiLoRA applies an additional low-rank decomposition to matrix B, which is typically task-specific. This means matrix B is factorized into a set of shared basis matrices and even smaller task-specific low-rank matrices. This design allows each task to retain flexibility while keeping the overall parameter growth minimal.
Third, LiLoRA incorporates a cosine-regularized basis stability loss. As the model learns new tasks, the shared basis (B0) might drift, causing misalignment with previously learned task-specific representations. This loss function penalizes large updates to the shared basis when the new task’s representation is dissimilar to previous ones, encouraging stability and knowledge retention over time.
Also Read:
- Enhancing AI’s Graph Understanding with Adaptive Data Views
- Advancing Private AI: A New Framework for Neural Fields on Edge Devices
Performance and Efficiency
Extensive experiments conducted on a diverse CVIT benchmark, which includes datasets for visual question answering, image classification, and image captioning, demonstrate LiLoRA’s effectiveness. The method consistently achieves superior performance in sequential task learning while significantly improving parameter efficiency compared to existing approaches.
For instance, compared to a direct LoRA expansion (DirLoRA) which assigns an independent LoRA module to each task and serves as an upper bound for performance, LiLoRA achieves competitive results while drastically reducing parameter overhead. DirLoRA incurs a total parameter cost of 2,143.9MB, with 357.3MB per task. In contrast, LiLoRA reduces the total parameters to 985.1MB and each task’s parameters to 104.6MB, representing a substantial 54% reduction in total parameters and a 70% saving in per-task overhead. Furthermore, LiLoRA can be fully merged into the pretrained weights during inference, introducing no extra computational overhead.
The research also highlights LiLoRA’s robustness across various configurations of shared and task-specific matrix ranks and its ability to generalize across different MLLMs, such as Qwen2-VL-2B. The learnable fusion coefficient (alpha) within LiLoRA dynamically adjusts the balance between shared and task-specific knowledge, further optimizing performance.
In conclusion, LiLoRA presents a novel and efficient solution for continual visual instruction tuning in MLLMs, effectively mitigating catastrophic forgetting while maintaining remarkable parameter efficiency. This advancement paves the way for more scalable and adaptable multimodal AI systems capable of continuously learning and evolving. You can read the full research paper here.


