TLDR: A study investigates ‘encoder redundancy’ in Multimodal Large Language Models (MLLMs), where adding more vision encoders doesn’t always improve performance and can even degrade it. It introduces Conditional Utilization Rate (CUR) and Information Gap (IG) metrics to quantify individual encoder contributions and overall redundancy. Findings show MLLMs are robust to encoder removal, contributions are task-dependent, and some encoders can be detrimental. Factors like pre-training, fusion strategy, number of encoders, and LLM capacity influence redundancy, highlighting the need for more efficient multi-encoder designs.
Multimodal Large Language Models, or MLLMs, are at the forefront of artificial intelligence, excelling at tasks that combine visual and textual information, like answering questions about images or describing complex scenes. A common approach to enhance their visual understanding is to equip them with multiple “vision encoders.” These encoders are like specialized eyes, each designed to capture different aspects of an image, from broad meanings to tiny details.
However, a new study reveals a surprising challenge: simply adding more vision encoders doesn’t always lead to better performance. In fact, it can sometimes make MLLMs less efficient or even degrade their performance. This phenomenon is termed “encoder redundancy,” where multiple encoders provide overlapping or even conflicting information, making it harder for the model to process effectively. This not only wastes computational resources but can also lead to suboptimal results.
Unveiling the Problem: How Redundancy Was Found
Researchers conducted a systematic investigation into this issue. They looked at state-of-the-art multi-encoder MLLMs and found that significant redundancy is indeed present. They empirically showed that these models can often maintain high performance even when some of their vision encoders are removed. In some cases, removing an encoder actually improved performance, directly validating the idea of redundancy.
New Tools for Measurement: CUR and IG
To precisely measure how much each encoder contributes and the overall level of redundancy, the researchers introduced two new metrics. The first is the Conditional Utilization Rate (CUR). This metric quantifies the unique contribution of an individual encoder. A high positive CUR means the encoder is very important, while a CUR close to zero suggests it is largely redundant. A negative CUR indicates that the encoder’s presence is actually harmful.
Building on CUR, they defined the Information Gap (IG). This metric measures the overall disparity in how useful different encoders are within a model. A large IG points to a significant imbalance, where some encoders are crucial and others are underutilized or counterproductive. These metrics provide a nuanced way to diagnose inefficiencies in current multi-encoder designs.
Key Findings from the Study
The experiments confirmed that encoder redundancy is common. MLLMs showed remarkable resilience, maintaining high performance even when one or two encoders were removed. Performance degraded gracefully, not catastrophically, suggesting overlapping information.
An encoder’s usefulness isn’t fixed; it depends on the task and the other encoders present. Some encoders were found to be detrimental in certain combinations, leading to “vision conflicts.”
The study also found task specialization. Certain encoders are highly specialized and indispensable for specific tasks, like OCR (Optical Character Recognition) or chart analysis, leading to high CUR and IG values. For more general tasks like visual question answering (VQA), encoders tend to be more interchangeable and redundant. Quantitatively, some encoders exhibited negative CUR values, meaning their inclusion actively harmed performance in specific contexts.
Factors Driving Redundancy
The study also explored what causes this redundancy. An encoder’s pre-training objective is a primary determinant of its function. Encoders trained for specific domains (e.g., ImageNet for OCR) are less redundant in those domains. However, general-purpose encoders (like CLIP or DINO) can be more mutually redundant on broad tasks.
The fusion strategy, or how the information from multiple encoders is combined, plays a crucial role. Simple or “naive” fusion methods, like direct concatenation, can exacerbate redundancy and even lead to performance drops, as they struggle to resolve conflicting features.
The number of encoders should also be considered. While adding more encoders might seem beneficial, the study suggests diminishing returns. Often, two encoders strike a good balance between performance and efficiency.
Finally, LLM capacity also plays a role. Larger MLLMs, while achieving higher overall performance, also showed more significant encoder redundancy, indicating greater inefficiency in their design.
Also Read:
- Unlocking Deeper AI Understanding of Human Videos with HV-MMBench
- Reinforcement Fine-tuning: A Robust Approach to Continual Learning in Large Language Models
Looking Ahead
This research challenges the common assumption that “more is better” when it comes to vision encoders in MLLMs. It provides a valuable framework for diagnosing architectural inefficiencies and opens doors for future work. The proposed CUR and IG metrics could help in selecting the most effective encoders, developing dynamic weighting strategies, or designing more sophisticated fusion mechanisms that actively mitigate redundancy. This study aims to pave the way for building more efficient and effective visual systems for the next generation of MLLMs. You can read the full paper here.


