Unlocking Advanced Visual Reasoning in AI with Long Grounded Thoughts

TLDR: The research paper “Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale” introduces a novel framework and a dataset of over 1 million synthetic vision-centric questions. This dataset is generated through a two-stage process: first, creating diverse, object-centric questions using image captions and bounding boxes, and second, composing these into harder, multi-hop problems. The framework also synthesizes rich, non-linear reasoning traces by combining VLMs and reasoning LLMs. Models fine-tuned on this data achieve state-of-the-art performance on vision-centric benchmarks, outperforming open and some closed-source models, and remarkably show positive transfer to text-only and audio reasoning tasks.

Recent advancements in multimodal reasoning, particularly those involving vision and language, have largely relied on undisclosed datasets and proprietary methods for data creation. This has left a significant gap in understanding how to systematically build large-scale, high-quality datasets for vision-centric reasoning tasks that go beyond simple visual math problems.

Introducing Long Grounded Thoughts: A New Approach to Visual Reasoning Data

A new research paper, Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale, introduces a novel data generation framework and dataset called ‘Long Grounded Thoughts’. This framework is designed to create over 1 million high-quality synthetic vision-centric questions, spanning diverse skills and levels of complexity. The dataset also includes preference data and instruction prompts to support both offline and online reinforcement learning (RL) methods.

The Two-Stage Data Synthesis Framework

The synthesis framework operates in two main stages to ensure both scale and complexity:

1. Scale and Diversity: In the first stage, the framework generates a large number of diverse and verifiable visual questions. It achieves this by using imagery combined with rich metadata, such as detailed captions and bounding boxes. By incorporating object-level metadata, the system can generate questions that are precisely grounded to specific visual elements, significantly enhancing question diversity and visual grounding compared to methods that rely solely on global captions.

2. Complexity via Composition: The second stage focuses on increasing the difficulty of the questions. A ‘composition hardening algorithm’ merges simpler questions generated in the first stage into more challenging, multi-hop visual problems. These complex problems require models to decompose them into intermediate steps and apply higher-order reasoning to find solutions.

Synthesizing Rich Reasoning Traces

To equip Vision-Language Models (VLMs) with sophisticated reasoning abilities, the paper introduces a two-stage process for synthesizing reasoning traces (also known as Chain-of-Thoughts or CoTs). This involves leveraging both VLMs and powerful reasoning Large Language Models (LLMs). First, initial CoT traces are distilled from VLMs to ensure they are within the distribution of VLM outputs. Then, these traces are expanded using reasoning LLMs, which inject richer, non-linear problem-solving strategies, capturing the diverse cognitive behaviors found in advanced reasoning models.

Remarkable Performance and Cross-Modality Transfer

The results of fine-tuning models on the Long Grounded Thoughts data are impressive. A Qwen2.5-VL-7B model fine-tuned on this data outperforms all open-data baselines across various evaluated vision-centric benchmarks. It even surpasses strong closed-data models like MiMo-VL-7B-RL on V*Bench, CV-Bench, and MMStar-V, and in some cases, proprietary systems such as GPT-4o and Claude 3.7.

Perhaps most surprisingly, despite being entirely vision-centric, the data positively transfers to other modalities. It shows improvements in text-only reasoning (MMLU-Pro, +2.98%) and audio reasoning (MMAU, +1.32%). Additionally, notable gains (+10%) are observed when evaluating on a single-evidence embodied QA benchmark (NiEH), even though the data contains no videos or embodied visual information.

Also Read:

Insights into VLM Post-Training

The research also provides valuable empirical analysis of the VLM post-training pipeline:

Supervised Fine-Tuning (SFT) on high-quality data with non-linear reasoning traces is essential for effective online Reinforcement Learning (RL).
Staged offline RL (SFT followed by DPO) can match the performance of online RL while significantly reducing computational demands.
Careful SFT on high-quality data can substantially improve out-of-domain and cross-modality transfer capabilities.

In conclusion, Long Grounded Thoughts offers a robust and scalable framework for generating high-quality, complex vision-centric reasoning data. This work significantly advances the development of open-source multimodal AI models, enabling them to achieve sophisticated reasoning capabilities and demonstrating impressive transferability across different modalities.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Advanced Visual Reasoning in AI with Long Grounded Thoughts

Introducing Long Grounded Thoughts: A New Approach to Visual Reasoning Data

The Two-Stage Data Synthesis Framework

Synthesizing Rich Reasoning Traces

Remarkable Performance and Cross-Modality Transfer

Insights into VLM Post-Training

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates