TLDR: Fourier-VLM is a novel method that enhances the efficiency of large vision-language models (VLMs) by compressing visual information in the frequency domain. Utilizing Discrete Cosine Transform (DCT), it significantly reduces the number of ‘vision tokens’ without compromising performance. This leads to substantial reductions in computational operations (FLOPs by up to 83.8%) and faster inference speeds (31.2% faster generation). The approach is parameter-free, highly generalizable across VLM architectures, and demonstrates promising zero-shot capabilities for video understanding tasks.
Large Vision-Language Models, or VLMs, are powerful AI systems that combine the reasoning abilities of large language models with the capacity to understand images. They do this by taking visual information from an image encoder and feeding it into the language model. However, a significant challenge with these models is the sheer volume of “vision tokens” – the digital representations of visual features – that are generated. This large number of tokens can make the models very slow and computationally expensive to run, especially when dealing with high-resolution images or multiple images.
Previous attempts to tackle this issue have involved methods like picking out only the most important visual features or using special “learnable queries” to reduce the token count. While these approaches offer some relief, they often come with trade-offs, either by slightly reducing the model’s performance or by adding their own computational burden.
A new approach, called Fourier-VLM, offers a simple yet highly effective solution to this problem. Instead of trying to select or merge tokens in the traditional way, Fourier-VLM compresses visual information in the “frequency domain.” This idea is inspired by a key observation: the visual features produced by image encoders tend to have most of their important information concentrated in what are called “low-frequency components.” Think of it like how a JPEG image compression works – it focuses on the essential, broader details rather than every tiny pixel variation.
Fourier-VLM leverages this by applying a “low-pass filter” to the visual features using a technique called a two-dimensional Discrete Cosine Transform (DCT). This might sound technical, but the important part is that DCT can be computed very quickly using a Fast Fourier Transform (FFT) operator. This means the method adds minimal extra computational cost and, crucially, introduces no new parameters to the model, making it very efficient.
The core of Fourier-VLM is its Frequency Feature Compressor (FFC) module. This module takes the visual features, reshapes them into a grid, applies the 2D-DCT to move them into the frequency domain, and then keeps only the low-frequency components. These compressed components are then transformed back into a spatial representation using an inverse DCT, ready to be fed into the language model. This process significantly reduces the number of vision tokens without losing much of the critical visual information.
Extensive testing has shown that Fourier-VLM performs exceptionally well across various image-based benchmarks. It maintains competitive performance even when drastically reducing the number of vision tokens. For instance, it can reduce the computational operations (FLOPs) by up to 83.8% and boost the generation speed by 31.2% compared to models like LLaVA-v1.5. This makes VLMs much more practical for real-world applications and devices with limited resources.
One of the strengths of Fourier-VLM is its strong generalizability. It has been successfully applied to both LLaVA and Qwen-VL architectures, demonstrating its versatility. Furthermore, even though it’s trained on single-image conversations, it shows promising “zero-shot” capabilities on video understanding tasks, meaning it can perform well on videos without specific video training. This opens up exciting possibilities for future video-language models.
Also Read:
- LiLoRA: A New Approach to Efficient Continual Learning in Multimodal AI
- Enhancing AI’s Graph Understanding with Adaptive Data Views
In conclusion, Fourier-VLM presents a compelling solution for making large vision-language models more efficient. By intelligently compressing visual information in the frequency domain, it achieves an excellent balance between maintaining high performance and significantly reducing computational costs and inference latency. This innovation paves the way for wider and more efficient deployment of VLMs in various applications. You can read the full research paper here.


