TLDR: FineVision is a new, large-scale, and meticulously curated open dataset of 24 million samples for training vision-language models (VLMs). It unifies over 200 sources through a semi-automated, human-in-the-loop pipeline, ensuring data hygiene, de-duplication, and decontamination against benchmarks. Models trained on FineVision consistently outperform those trained on existing open datasets, demonstrating its value for advancing VLM research, especially for tasks like GUI interaction.
The advancement of vision-language models (VLMs) has been significantly impacted by the quality and consistency of their training data. Historically, the open research community has grappled with a fragmented landscape of public datasets that are often inconsistent and contaminated. To overcome this critical bottleneck, a team of researchers has introduced FineVision, a meticulously collected, curated, and unified corpus designed to provide a robust foundation for VLM development.
FineVision is presented as the largest open resource of its kind, boasting an impressive collection of over 24 million samples. This includes 17 million images, 89 million conversational turns, and 9.5 billion answer tokens. The project unifies more than 200 diverse data sources into 185 distinct subsets through a sophisticated semi-automated, human-in-the-loop pipeline.
A Rigorous Curation Process
The creation of FineVision involves a multi-stage process to ensure data quality and integrity. It begins with the bulk ingestion of raw data and automated schema mapping. Crucially, human reviewers are integrated at various checkpoints, auditing mappings, signing off on scripts, and conducting post-conversion audits. This human oversight ensures faithful consumption of annotations, consistent quality, and safety, with any identified issues triggering targeted fixes and re-runs.
A cornerstone of FineVision’s data hygiene is its rigorous de-duplication and test-set decontamination. The pipeline utilizes self-supervised copy-detection (SSCD) embeddings to identify and merge visually near-identical images within FineVision. Furthermore, it decontaminates the dataset against 66 public VLM benchmarks, mitigating train-test leakage and preserving the integrity of model evaluations. This meticulous approach helps prevent models from inadvertently learning from data that is too similar to their evaluation sets.
Unified Data for Diverse Tasks
FineVision converts each original dataset into a standardized chat format, making it suitable for instruction tuning. This unification process handles a wide array of annotation styles, from simple image QA to complex multi-image conversations and relational graphs. The dataset supports six core task-specific conversion strategies, including Visual QA, Captioning & Description, Grounding & Spatial Relations, Document Understanding, OCR & Transcription, and Classification & Detection. This broad coverage ensures that models trained on FineVision can develop a wide range of visual and linguistic capabilities.
A notable feature of FineVision is its inclusion of agentic and GUI-grounded tasks with a unified action space. This addresses a significant challenge in the field, as different sources often define heterogeneous function signatures and action taxonomies. By standardizing the action space, FineVision enables cross-domain training, allowing models to learn coherent action patterns across diverse GUI environments, such as desktop, mobile, or browser interfaces.
Also Read:
- Unlocking Image AI for African Languages with AfriCaption
- GraphVista: A New Approach for Understanding Large-Scale Graphs with AI
Demonstrated Performance and Future Impact
Extensive experiments validate the effectiveness of FineVision. Models trained on this corpus consistently achieve state-of-the-art results among open-data VLMs, demonstrating significant performance improvements over existing open mixtures like The Cauldron, Cambrian-1, and LLaVA-OneVision across a broad suite of 11 benchmarks. These gains are attributed not only to the massive scale but also to the superior data hygiene and balanced conceptual diversity of FineVision.
The researchers are making the FineVision corpus, its conversion recipes, de-duplication tools, and precomputed embeddings publicly available. This open release aims to democratize access to high-quality training data and accelerate the next wave of innovation in open VLM development. For a deeper dive into the methodology and results, the original research paper can be accessed here.


