TLDR: CoViPAL is a novel method that significantly improves the efficiency of Large Vision-Language Models (LVLMs) by intelligently pruning redundant visual tokens across all layers. It uses a lightweight, plug-and-play module trained in two stages to identify and remove less important visual information, leading to faster inference, reduced memory usage, and the ability to handle larger inputs without compromising accuracy. This approach outperforms existing pruning methods and makes LVLMs more scalable for real-world applications.
Large Vision-Language Models, or LVLMs, have become incredibly powerful in understanding and generating content from images and videos. These models work by breaking down visual information into thousands of ‘vision tokens’. While this rich detail helps them understand complex visuals, it also creates a significant challenge: high computational costs and memory demands, especially during the initial processing (prefilling) and decoding stages. This can slow down the models and make them difficult to use in real-time applications or on devices with limited resources.
Existing methods have tried to tackle this by pruning, or removing, redundant vision tokens. These methods have shown that there’s a lot of unnecessary information in visual representations. However, they often struggle with the ‘shallow layers’ of the model – the early stages of processing. This is because these shallow layers lack enough contextual information to accurately decide which tokens are truly redundant, leading to a drop in performance if too many are removed.
Introducing CoViPAL: A Smart Pruning Solution
A new research paper, CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models, proposes an innovative solution to this problem. CoViPAL, which stands for Layer-wise Contextualized Visual Token Pruning, argues that many visual tokens are redundant even in these shallow layers and can be safely removed if guided by the right contextual signals. The method introduces a Plug-and-Play Pruning Module (PPM) that predicts and removes these unnecessary vision tokens before the main LVLM even processes them.
The PPM is designed to be lightweight and can work with various LVLM architectures without needing major changes. This makes it easy to integrate into existing models.
How CoViPAL Works
CoViPAL uses a two-stage training strategy to achieve its efficiency:
The first stage focuses on teaching the pruning module to identify important tokens. It does this by leveraging the attention weights from deeper layers of the LVLM. These deeper layers contain more contextual information, which helps highlight which tokens are truly relevant. The pruning module learns to predict importance scores for visual tokens by minimizing the difference between its predictions and these accumulated attention weights.
The second stage refines this learning through an end-to-end training process. Since directly removing tokens during training is difficult for the model to learn from, CoViPAL simulates pruning using a ‘soft attention mask’. This mask assigns lower importance to redundant tokens, effectively making the model pay less attention to them. To ensure the model learns to clearly distinguish between important and unimportant tokens, a special regularization technique is used. This technique encourages the model to assign high scores to crucial tokens and low scores to less relevant ones, aligning its learning with how pruning will happen during actual use.
Impressive Results and Efficiency Gains
Extensive experiments on various image and video benchmarks demonstrate CoViPAL’s effectiveness. When pruning 75% of visual tokens, CoViPAL reduced the prefilling time by up to 60% compared to the original models, with only minimal impact on performance. It also significantly improved decoding speed and achieved over 1 Gigabyte of memory savings. Notably, CoViPAL enabled the processing of 64-frame video inputs on a 24GB GPU, a task that the original model and other baseline methods failed due to memory limitations.
CoViPAL consistently outperformed other token pruning methods like FastV, SparseVLM, and PyramidDrop, both in training-free and training-based scenarios, even with comparable or less training data. The research also found that videos tend to be more information-sparse than images, meaning they contain a higher proportion of redundant visual tokens, making video tasks particularly robust to CoViPAL’s pruning.
Also Read:
- Unlocking Speed in Video LLMs: Verifier-Guided Token Pruning for Faster Decoding
- Optimizing Multimodal AI: VISA’s Smart Token Compression
Conclusion
CoViPAL represents a significant step forward in making Large Vision-Language Models more efficient and scalable. By intelligently pruning redundant visual tokens across all layers, it addresses a critical bottleneck in LVLM inference, paving the way for their deployment in more resource-constrained environments without sacrificing accuracy. This work by Zicong Tang, Ziyang Ma, Suqing Wang, Zuchao Li, Lefei Zhang, Hai Zhao, Yun Li, and Qianren Wang offers a practical and effective solution for optimizing multimodal AI.


