TLDR: Spotlighter is a new framework that improves prompt tuning in vision-language models like CLIP by intelligently selecting only the most relevant visual tokens. It uses a semantic memory bank with class prototypes and a two-level ranking system to refine token selection, boosting accuracy by up to 11.19% and inference speed by 0.8K FPS with only 21 extra parameters, making it a highly efficient and accurate method for few-shot image classification.
The field of artificial intelligence, particularly in vision-language models, has seen significant advancements, with prompt tuning emerging as a powerful technique for achieving robust cross-modal semantic alignment. Models like CLIP have demonstrated impressive capabilities in tasks ranging from open-domain recognition to fine-grained classification. However, a common challenge in these models is the presence of redundant or weakly relevant feature components, which can introduce noise and lead to unnecessary computational costs.
To address these issues, a new framework called Spotlighter has been proposed by Yutong Gao, Maoyuan Shao, Xinyang Huang, Chuang Zhu, Lijuan Sun, Yu Weng, Xuan Liu, and Guoshun Nan. This innovative, lightweight token-selection framework aims to enhance both accuracy and efficiency in prompt tuning simultaneously. Spotlighter operates by carefully evaluating each visual token’s activation from two key perspectives: sample-wise and semantic-wise. Based on these evaluations, it retains only the top-scoring tokens for subsequent prediction tasks.
A crucial component of Spotlighter is its class-specific semantic memory bank. This bank stores learned prototypes that play a vital role in refining the token selection process. By ensuring semantic representativeness, these prototypes help compensate for any features that might have been discarded during the initial selection. To further prioritize informative signals, Spotlighter incorporates a two-level ranking mechanism that dynamically weights the interactions between tokens and prototypes.
The core idea behind Spotlighter is to identify and retain a compact set of highly representative feature tokens while effectively discarding redundant ones. This approach helps suppress noise and reduces computational overhead. The framework achieves this by calculating activation scores for each token, reflecting its importance in cross-modal semantic alignment. These scores are derived from both sample-level (similarity between image and text features) and semantic-level (matching against class prototypes in the memory bank) perspectives.
The semantic memory bank is continuously updated during training, with prototypes being refined to better capture the most salient information for each image category. This dynamic update mechanism, combined with a local loss function, minimizes subjectivity in feature selection and enhances cross-modal knowledge transfer.
To further refine the selected features and compensate for any potential semantic loss from discarded regions, Spotlighter fuses the activated features with their corresponding semantic prototypes. Recognizing that activated features contribute differently to classification, a two-level ranking mechanism stratifies these tokens based on their activation scores. These stratified tokens, along with the prototypes, are then processed through specialized mapping modules (Image Representative Mapping Module and Text Representative Mapping Module) to generate discriminative cross-modal representations.
Extensive experiments were conducted across 11 few-shot benchmarks, demonstrating Spotlighter’s effectiveness. The framework consistently outperformed CLIP, achieving an improvement of up to 11.19% in harmonic mean accuracy. Furthermore, it significantly boosted computational efficiency, achieving up to 0.8K additional frames per second (FPS), all while adding only 21 extra parameters. These impressive results establish Spotlighter as an effective and scalable baseline for prompt tuning.
The authors highlight several key contributions of their work: investigating the role of representative feature mining in prompt tuning for both accuracy and efficiency, proposing the Spotlighter framework for selecting and enhancing activated tokens, and demonstrating significant accuracy and inference speed boosts with minimal parameter overhead.
While Spotlighter shows great promise, the authors acknowledge its limitations. It is primarily designed for image classification and may not generalize well to other vision tasks requiring dense or spatially localized predictions, such as object detection or image segmentation. This is because the reduced number of final tokens might omit fine-grained spatial details. Future work aims to extend Spotlighter to these dense prediction tasks by incorporating spatial-aware token selection and adaptive token filtering strategies.
Also Read:
- Unlocking Geographic Insights: Interpretable AI for Location Prediction
- Enhancing Vision Transformers with Challenging Synthetic Negatives
For more in-depth technical details, you can refer to the full research paper available at arXiv:2509.00905.


