TLDR: VLMQ is a novel post-training quantization (PTQ) framework for large Vision-Language Models (VLMs) that addresses the performance degradation caused by redundant vision tokens. It introduces an importance-aware objective and an enhanced Hessian matrix, assigning higher importance to salient tokens. By using a lightweight block-wise backward pass to compute token-level importance factors, VLMQ achieves state-of-the-art performance, especially in low-bit settings, making VLMs more practical for resource-limited deployment.
Large AI models, especially those that understand both images and text, known as Vision-Language Models (VLMs), are incredibly powerful. However, their massive size makes them difficult to use on everyday devices with limited resources. This is where a technique called Post-Training Quantization (PTQ) comes in. PTQ helps compress these large models and speed up their operations without needing to retrain them from scratch, which is a very costly and time-consuming process.
While PTQ has been widely explored for Large Language Models (LLMs), its application to VLMs has faced unique challenges. The core issue identified by researchers is a “modality discrepancy.” Simply put, VLMs deal with a lot of visual information, which often contains significant redundancy, while text tokens are more concise. Existing PTQ methods, particularly those based on a mathematical concept called Hessian, tend to treat all these tokens equally. This uniform treatment leads to a significant drop in performance when applied to VLMs because the quantization process gets biased by the overwhelming and often redundant visual data.
To tackle this problem, a new framework called VLMQ (Vision-Language Model Quantization) has been proposed. VLMQ introduces an “importance-aware” approach to PTQ specifically designed for VLMs. The key idea is to recognize that not all pieces of information (tokens) are equally important. Some visual tokens might be highly redundant, and giving them the same weight as crucial text or visual tokens can degrade the model’s accuracy after compression.
Also Read:
- Making Large AI Image Models Accessible: A Hierarchical Approach to Compression
- Refine-IQA: A Multi-Stage Approach to Enhancing Image Quality Assessment in AI Models
How VLMQ Works
VLMQ addresses the redundancy in vision tokens by optimizing a new objective function. This function enhances the Hessian matrix – a mathematical tool that guides the quantization process – by incorporating token-level importance factors. This means that more important tokens are given higher weight, while redundant ones are down-weighted. Crucially, this enhancement is designed to remain compatible with existing parallelized weight update methods, ensuring efficiency.
To determine these importance factors efficiently and effectively, VLMQ uses a clever technique: it computes them via a single, lightweight “block-wise backward pass.” This process is guided by a theoretical understanding of how small changes at the token level affect the overall model’s performance. Essentially, it identifies which tokens, when perturbed, cause the most significant impact on the model’s output, thus indicating their importance.
The researchers conducted extensive evaluations of VLMQ across eight different benchmarks, using VLMs ranging in size from 0.5 billion to 32 billion parameters. The results show that VLMQ achieves state-of-the-art performance, especially when models are quantized to very low bit settings (e.g., 2-bit quantization). For instance, it demonstrated a substantial 16.45% improvement on the MME-RealWorld benchmark under 2-bit quantization, highlighting its effectiveness in preserving accuracy even under aggressive compression.
The paper also delves into a “pilot study” that confirms the visual over-representation problem. It shows that while including vision tokens is necessary for VLM quantization, an excessive number of redundant ones can hurt performance. The study found that performance peaked when about 50% of vision tokens were manually assigned low importance, validating the need for a balanced approach.
VLMQ is designed to be fully compatible with existing Hessian-based PTQ frameworks like GPTQ and GPTAQ, meaning it can leverage their efficiency tricks. The additional computational overhead introduced by VLMQ is minimal, primarily involving a single local forward and backward pass per decoding layer, which adds negligible latency in practice.
In conclusion, VLMQ offers a significant step forward in making large Vision-Language Models more practical for real-world deployment. By intelligently accounting for the varying importance of different data tokens, it allows for much more efficient compression without sacrificing the model’s impressive capabilities. For more technical details, you can refer to the full research paper.


