TLDR: The research introduces a new method for filtering image-text datasets using a small, fine-tuned Vision-Language Model (VLM). This compact VLM acts as an “in-context judge” to evaluate and filter data based on image and caption quality and alignment. Unlike previous methods that add complex modules, this approach uses the VLM’s inherent capabilities, making it efficient and cost-effective. Experiments show that datasets filtered by this method lead to better image-text alignment, more fluent captions, and improved performance in downstream tasks like image captioning, even with significantly smaller datasets.
Vision-language models (VLMs) are at the forefront of AI, blending visual and textual data to enable advanced multimodal understanding. These models power applications like image captioning and visual question answering. However, their effectiveness heavily relies on the quality of the training data. Simply increasing the amount of data isn’t enough; carefully selected, high-quality examples often lead to superior results.
Publicly available multimodal datasets, frequently gathered through extensive web scraping, often contain significant noise. This noise can include irrelevant or incorrect captions and inherent biases, which can lead to performance issues like hallucinations in VLMs. To combat this, researchers are exploring various data curation strategies, with a growing trend towards using VLMs themselves for quality evaluation.
A new research paper, titled “TRUST THE MODEL: COMPACT VLMS AS IN-CONTEXT JUDGES FOR IMAGE-TEXT DATA QUALITY,” introduces a novel and efficient data filtration framework. Authored by Daulet Toibazar, Kesen Wang, Sherif Mohamed, Abdulaziz Al-Badawi, Abdulrahman Alfulayt, and Pedro J. Moreno, this work proposes using a compact VLM, specifically fine-tuned on a high-quality image-caption dataset, to act as an in-context judge for data quality.
Unlike previous methods that often add complex, auxiliary filtration modules on top of large VLMs, this approach leverages the inherent evaluative capabilities of a purpose-built small VLM. This strategy significantly reduces training overhead and eliminates the need for extra modules, making it a lightweight and robust solution for building high-quality vision-language training corpora. The model efficiently filters out inaccurate and noisy web data, leading to improved image-text alignment and better linguistic fluency in captions.
How the Filtration Model Works
The methodology involves a two-stage process. First, a state-of-the-art multimodal model, Gemini 2.0-Flash, was used as a “teacher” to annotate training data. This teacher model assigned a quality score (1-10) and a detailed textual explanation to image-text pairs from datasets like Recap-COCO and CC12M. This rich annotation signal allowed the downstream filtration model to learn both how to score and explain image-caption quality. Manual review was also conducted to ensure annotation accuracy.
In the second stage, a compact VLM, Qwen2-VL-2B, was fine-tuned using these Gemini-generated annotations. This process enabled the smaller model to mimic the teacher’s scoring behavior and explanations, providing a cost- and compute-efficient solution for filtering large-scale multimodal datasets without relying on expensive external APIs or large models.
Impact of Data Filtration
The researchers evaluated their approach by filtering over 20,000 image-caption pairs. They retained only high-quality pairs (scores of 9 or higher), resulting in a filtered dataset that was only 18% the size of the original. This filtered dataset was then compared against the full dataset and a randomly sampled dataset of the same size.
Experimental results demonstrated significant improvements across several metrics:
- Semantic Alignment: The filtered dataset showed stronger text-to-image alignment, as evidenced by higher mean cosine similarity scores using the CLIP encoder. This indicates that the filtration model effectively retains samples where visual and textual modalities are more semantically consistent.
- Linguistic Fluency and Complexity: Captions in the filtered dataset exhibited substantially lower perplexity scores, suggesting that the filtering procedure successfully removes noisy or misaligned captions, leaving behind cleaner, higher-quality data.
- Downstream Captioning Performance: When a lightweight image captioning model was fine-tuned on the filtered dataset, its generated captions were preferred over those from a model trained on the full dataset in nearly 60% of cases, as judged by Gemini 2.0 Flash. This highlights that focused curation can significantly improve the training signal.
An ablation study further confirmed that higher filtration scores assigned by their compact VLM correlated directly with stronger semantic alignment between images and text, validating the model’s effectiveness as a reliable proxy for selecting good-quality image-text pairs.
Also Read:
- TransPrune: Boosting Efficiency in Large Vision-Language Models Through Token Transition Analysis
- LOTUS: A New Framework for Evaluating Advanced Image Captioning
Conclusion and Future Directions
This research underscores the critical value of data quality over quantity in training VLMs. By intelligently filtering noisy data, the compact VLM approach leads to noticeable gains in caption coherence, image-text alignment, and downstream model performance. While the results are promising, the paper also acknowledges a trade-off: aggressive filtering might reduce data diversity, potentially limiting a model’s generalization capabilities. Future work will explore more adaptive filtering strategies to balance alignment, variability, and robustness more effectively. For more details, you can refer to the original research paper.


