TLDR: The research paper “Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale” introduces a novel framework and a dataset of over 1 million synthetic vision-centric questions. This dataset is generated through a two-stage process: first, creating diverse, object-centric questions using image captions and bounding boxes, and second, composing these into harder, multi-hop problems. The framework also synthesizes rich, non-linear reasoning traces by combining VLMs and reasoning LLMs. Models fine-tuned on this data achieve state-of-the-art performance on vision-centric benchmarks, outperforming open and some closed-source models, and remarkably show positive transfer to text-only and audio reasoning tasks.
Recent advancements in multimodal reasoning, particularly those involving vision and language, have largely relied on undisclosed datasets and proprietary methods for data creation. This has left a significant gap in understanding how to systematically build large-scale, high-quality datasets for vision-centric reasoning tasks that go beyond simple visual math problems.
Introducing Long Grounded Thoughts: A New Approach to Visual Reasoning Data
A new research paper, Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale, introduces a novel data generation framework and dataset called ‘Long Grounded Thoughts’. This framework is designed to create over 1 million high-quality synthetic vision-centric questions, spanning diverse skills and levels of complexity. The dataset also includes preference data and instruction prompts to support both offline and online reinforcement learning (RL) methods.
The Two-Stage Data Synthesis Framework
The synthesis framework operates in two main stages to ensure both scale and complexity:
1. Scale and Diversity: In the first stage, the framework generates a large number of diverse and verifiable visual questions. It achieves this by using imagery combined with rich metadata, such as detailed captions and bounding boxes. By incorporating object-level metadata, the system can generate questions that are precisely grounded to specific visual elements, significantly enhancing question diversity and visual grounding compared to methods that rely solely on global captions.
2. Complexity via Composition: The second stage focuses on increasing the difficulty of the questions. A ‘composition hardening algorithm’ merges simpler questions generated in the first stage into more challenging, multi-hop visual problems. These complex problems require models to decompose them into intermediate steps and apply higher-order reasoning to find solutions.
Synthesizing Rich Reasoning Traces
To equip Vision-Language Models (VLMs) with sophisticated reasoning abilities, the paper introduces a two-stage process for synthesizing reasoning traces (also known as Chain-of-Thoughts or CoTs). This involves leveraging both VLMs and powerful reasoning Large Language Models (LLMs). First, initial CoT traces are distilled from VLMs to ensure they are within the distribution of VLM outputs. Then, these traces are expanded using reasoning LLMs, which inject richer, non-linear problem-solving strategies, capturing the diverse cognitive behaviors found in advanced reasoning models.
Remarkable Performance and Cross-Modality Transfer
The results of fine-tuning models on the Long Grounded Thoughts data are impressive. A Qwen2.5-VL-7B model fine-tuned on this data outperforms all open-data baselines across various evaluated vision-centric benchmarks. It even surpasses strong closed-data models like MiMo-VL-7B-RL on V*Bench, CV-Bench, and MMStar-V, and in some cases, proprietary systems such as GPT-4o and Claude 3.7.
Perhaps most surprisingly, despite being entirely vision-centric, the data positively transfers to other modalities. It shows improvements in text-only reasoning (MMLU-Pro, +2.98%) and audio reasoning (MMAU, +1.32%). Additionally, notable gains (+10%) are observed when evaluating on a single-evidence embodied QA benchmark (NiEH), even though the data contains no videos or embodied visual information.
Also Read:
- Evaluating AI’s Thought Process: A New Metric for Multimodal Reasoning
- MVU-Eval: A New Benchmark for AI’s Multi-Video Understanding
Insights into VLM Post-Training
The research also provides valuable empirical analysis of the VLM post-training pipeline:
- Supervised Fine-Tuning (SFT) on high-quality data with non-linear reasoning traces is essential for effective online Reinforcement Learning (RL).
- Staged offline RL (SFT followed by DPO) can match the performance of online RL while significantly reducing computational demands.
- Careful SFT on high-quality data can substantially improve out-of-domain and cross-modality transfer capabilities.
In conclusion, Long Grounded Thoughts offers a robust and scalable framework for generating high-quality, complex vision-centric reasoning data. This work significantly advances the development of open-source multimodal AI models, enabling them to achieve sophisticated reasoning capabilities and demonstrating impressive transferability across different modalities.


