TLDR: FLoC is a novel, training-free, and model-agnostic framework that efficiently compresses visual tokens from long video sequences. It uses a facility location function and a lazy greedy algorithm to select a compact, highly representative, and diverse subset of tokens, drastically reducing the input volume for Large Multimodal Models (LMMs). This approach overcomes the scalability limitations of LMMs in long video understanding, outperforming existing compression techniques in accuracy and processing speed across various benchmarks.
Understanding long video sequences has become a significant challenge for advanced Artificial Intelligence models, particularly Large Multimodal Models (LMMs). These models, which combine visual and language reasoning, are powerful but face a major hurdle: the sheer volume of visual information, or ‘visual tokens,’ generated from extended videos. This overwhelming data can severely limit their ability to process and comprehend long-duration content.
Addressing this critical bottleneck, researchers have introduced a new framework called FLoC, which stands for Facility Location-Based Efficient Visual Token Compression. FLoC offers an innovative solution to efficiently reduce the number of visual tokens without losing crucial information, making long video understanding more scalable and practical for LMMs.
What FLoC Does
At its core, FLoC is designed to swiftly select a compact yet highly representative and diverse subset of visual tokens from a video. Imagine a long video of a person playing golf; many frames might show similar background scenery. FLoC intelligently identifies and keeps the most important tokens – those that capture unique actions or significant scene changes – while discarding redundant ones. This selection process operates within a predefined budget for the number of visual tokens, ensuring that the compressed data remains manageable for LMMs.
A key aspect of FLoC is its use of the facility location function, a principled mathematical approach that helps balance the need for representativeness (ensuring the selected tokens cover the overall video content) and diversity (making sure different aspects of the video are captured). To achieve remarkable efficiency, FLoC integrates a ‘lazy greedy algorithm.’ This smart algorithm significantly speeds up the selection process, guaranteeing near-optimal performance with minimal computational effort.
Key Advantages and How It Stands Out
FLoC boasts several significant advantages that make it a versatile and powerful tool:
- Training-Free: Unlike many other compression methods that require extensive training, FLoC works right out of the box.
- Model-Agnostic: It can be seamlessly integrated with various video-LMMs without needing specific adaptations for each model.
- Query-Agnostic: FLoC compresses tokens once, regardless of the user’s query. This is a major efficiency gain compared to ‘query-aware’ methods that might need to re-compress for every new question.
Traditional approaches to visual token compression often fall short. Simple sampling or pooling methods might discard critical, rare information. Clustering techniques, while better, can still miss important but sparsely occurring details – like a small object of interest in a cluttered room. Other methods might require retraining or are specific to certain tasks, limiting their flexibility.
FLoC overcomes these limitations by explicitly optimizing for global coverage. It ensures that even rare but meaningful visual cues are preserved, preventing oversampling from common scenes and prioritizing selections that maximize overall representativeness and diversity. This is particularly crucial for tasks where fine details matter, such as finding car keys in a video recorded by smart glasses.
Also Read:
- GeoToken: Pinpointing Image Locations with Hierarchical Precision
- Visual-Contrast Attention: A New Approach for Efficient Vision Transformers
Performance and Efficiency
Extensive evaluations on large-scale benchmarks like Video-MME, MLVU, and LongVideoBench have shown that FLoC consistently outperforms recent compression techniques. It not only achieves higher accuracy in video understanding tasks but also does so with superior processing speed. For instance, FLoC has been shown to be significantly faster than traditional clustering methods, sometimes by a factor of 10 or more, especially as the video length increases.
The framework has demonstrated particular strength in challenging tasks such as ‘Needle Question Answering’ (identifying a very short, distinct event within a long video) and ‘Ego Reasoning’ (understanding fleeting objects in first-person videos). This highlights FLoC’s ability to retain fine-grained details even under high compression ratios.
By enabling LMMs to efficiently process a much larger number of frames than conventionally possible, FLoC significantly enhances their overall video understanding capabilities. This opens doors for more effective real-world applications, from surveillance systems and smart glasses to autonomous navigation for robots.
For more in-depth technical details, you can refer to the full research paper: FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding.


