TLDR: A new research paper introduces LaCoT, a novel method that enhances visual reasoning in Large Vision-Language Models (LVLMs). LaCoT reformulates reasoning as posterior inference using amortized variational inference, addressing limitations of existing training algorithms like poor generalization and reliance on biased reward models. It features Reference-Guided GFlowNet Fine-tuning (RGFN) for better exploration, Token-level Marginal Reward Approximation for efficient training, and Bayesian Inference over Latent Rationales (BiN) for robust answer selection. Empirical results show significant performance improvements on various visual reasoning benchmarks, demonstrating LaCoT’s effectiveness, generalization, and interpretability.
In the rapidly evolving world of artificial intelligence, Large Vision-Language Models (LVLMs) are at the forefront, combining visual understanding with natural language processing to tackle complex reasoning tasks. A crucial element for these models is Chain-of-Thought (CoT) reasoning, which allows them to break down problems into explicit, step-by-step rationalizations, significantly improving their interpretability and reliability.
However, existing training methods for CoT, such as Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), and Group Relative Policy Optimization (GRPO), face significant challenges. These methods often struggle to generalize across diverse, unseen reasoning tasks and can be overly reliant on biased reward models. They also tend to limit the diversity of reasoning paths, potentially leading to ‘reward hacking’ where models achieve high scores without truly solving the underlying problem.
A new research paper, titled “Latent Chain-of-Thought for Visual Reasoning,” introduces a novel approach called LaCoT, designed to overcome these limitations. The authors, Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao, reformulate reasoning in LVLMs as a problem of posterior inference. This allows for a more robust and scalable training algorithm based on amortized variational inference.
Key Innovations of LaCoT
LaCoT introduces several innovative components to enhance visual reasoning:
1. Reference-Guided GFlowNet Fine-tuning (RGFN): This algorithm addresses the challenge of exploration in reinforcement learning. Traditional methods can lead to ‘catastrophic forgetting,’ where models generate meaningless content. RGFN guides the exploration process by comparing candidate reasoning paths against a high-quality reference. Only paths that outperform the reference are used for gradient updates, reducing variance and preventing the model from collapsing into low-quality trajectories.
2. Token-level Marginal Reward Approximation: Training with CoT sequences, especially long ones, can be computationally expensive due to the need for token-level rewards. LaCoT introduces an efficient approximation method that estimates intermediate rewards using linear interpolation. This significantly reduces computational overhead without sacrificing accuracy, making mini-batch exploration for diverse sampling more feasible.
3. Bayesian Inference over Latent Rationales (BiN): At inference time, selecting the best reasoning path and answer is critical. Current methods like Best-of-N (BoN) or Beam Search are computationally costly and depend on potentially biased critic models. BiN offers a probabilistic approach, sampling multiple latent rationales and answers, then estimating the marginal likelihood of each answer. The answer with the highest estimated marginal likelihood is selected, providing a scalable, probabilistically justified strategy that improves interpretability and mitigates reward hacking.
Also Read:
- Bridging the Modality Gap: New Training Strategies for Balanced AI Reasoning
- Improving How AI Models ‘Think with Images’ for Better Reasoning
Empirical Success and Impact
The LaCoT method was empirically tested on two base models, Qwen2.5-VL-3B and 7B, across seven visual reasoning benchmarks, including MathVista, MathVision, MathVerse, MMMU, MMVet, and MME. The results were impressive, with the 7B model showing a 6.6% improvement over its base model and outperforming GRPO by 10.6%. The 3B model also surpassed its base model by 13.9% and even achieved better results than larger models like LLaVA-CoT-11B and LLaVA-OV-7B.
LaCoT demonstrated consistent improvements in general multi-modal reasoning, particularly boosting diagram comprehension and OCR robustness on benchmarks like MathVerse-Vision-only. Furthermore, the model was shown to sample rationales with higher diversity and log-likelihood compared to baseline models, increasing the probability of finding correct answers.
The ablation studies confirmed the effectiveness of RGFN in avoiding misleading optimization and promoting diverse trajectories. BiN also consistently outperformed BoN in inference-time scaling, showing that increasing the number of rationale candidates and temperature systematically improves accuracy by reducing variance and offering broader posterior coverage.
In conclusion, LaCoT represents a significant advancement in visual reasoning for LVLMs. By leveraging amortized variational inference, it enables the generation of diverse and plausible reasoning trajectories, leading to more effective, generalizable, and interpretable AI models. This research paves the way for future work in knowledge distillation and synthetic data generation for visual reasoning tasks. You can read the full paper here: Latent Chain-of-Thought for Visual Reasoning.


