spot_img
HomeResearch & DevelopmentAI's Culinary Insight: Enhancing Visual Question Answering for Indian...

AI’s Culinary Insight: Enhancing Visual Question Answering for Indian Food with Reasoning Chains

TLDR: This research introduces ‘Thought-For-Food,’ a novel approach to improve Visual Question Answering (VQA) for Indian cuisine, which is often overlooked by Western-biased AI systems. By automatically generating and validating multi-step reasoning chains, and then training models using supervised fine-tuning and reinforcement learning, the system significantly boosts accuracy (up to 71.12%). The method helps AI better understand complex culinary contexts and relationships in diverse Indian dishes, demonstrating the efficacy of reasoning-driven approaches for culturally varied food domains.

In the rapidly evolving world of Artificial Intelligence, Visual Question Answering (VQA) systems have shown remarkable capabilities in interpreting images and answering questions about them. However, a significant gap exists when these systems encounter the rich and diverse culinary landscape of India. Traditional VQA models often fall short because they are primarily trained on Western food, struggling with the unique complexities of Indian dishes.

A new research paper titled “Thought-For-Food: Reasoning Chain Induced Food Visual Question Answering” addresses this challenge head-on. Authored by Riddhi Jain, Manasi Patwardhan, Parijat Deshpande, and Venkataramana Runkana from TCS-Research, this work proposes a novel approach to enhance AI’s understanding of Indian food through structured, multi-step reasoning. You can read the full paper here.

The Unique Challenge of Indian Cuisine for AI

Indian food is incredibly diverse, influenced by geography, religion, and traditions. A single meal can feature items differing in preparation, presentation, and flavor. This richness poses unique challenges for AI systems. Existing Food VQA systems, even those attempting to cover Indian food like IndiFoodVQA, often follow a two-step process: generating an answer first, then an explanation. The researchers argue that for Indian food, an accurate answer often requires a multi-step reasoning process, understanding complex culinary contexts and relationships between various food items.

Introducing Reasoning Chains for Food VQA

The core innovation of this research is the creation of “reasoning chains.” Instead of directly predicting an answer, the models are trained to follow a logical sequence of sub-questions and sub-answers that lead to the final correct response. This mimics human thought processes when analyzing a complex visual scene and answering a question about it. These reasoning chains are synthesized with minimal human intervention, making the process scalable.

How the System Works: A Two-Stage Training Approach

The methodology involves two main stages:

1. Supervised Fine-Tuning (SFT): The process begins by using the IndiFoodVQA dataset, which provides images of Indian food, questions, answer choices, and reasons. The researchers augment this dataset by generating step-wise reasoning chains. They use Vision-Language Models (VLMs) to identify food items and their positions in an image, and then Large Language Models (LLMs) to generate the reasoning chains based on these visual cues and a few human-annotated examples. These synthesized reasoning chains are then validated, and only the ones leading to the correct answer are used to fine-tune smaller LLMs and VLMs. This ensures the models learn to generate logical, step-by-step explanations.

2. Reinforcement Learning (RL): Following SFT, the models undergo further optimization using reinforcement learning techniques, specifically Direct Preference Optimization (DPO) and Group Relative Preference Optimization (GRPO). In this stage, the models are rewarded for generating reasoning chains that lead to the correct answer and penalized for incorrect ones. This allows the models to learn from a larger dataset, including reasoning chains that might not have been validated in the SFT stage, making them more robust and consistent.

Key Findings and Performance Improvements

The research yielded several important insights and demonstrated significant performance gains:

  • Inherent Understanding: Initial “zero-shot” tests showed low accuracy (31-51%), indicating that models lack an inherent understanding of Indian cuisine without specific training.
  • Vision vs. Language Models: Vision-Language Models (VLMs) generally outperformed pure Language Models (LLMs) in zero-shot scenarios, likely because VLMs can directly process image information, while LLMs rely on extracted visual descriptions.
  • Power of Reasoning: Models specifically tuned for reasoning, like DeepSeek, showed remarkable improvement (up to 95.04%) after reasoning-aligned SFT and RL training, highlighting their adaptability to this task.
  • Significant Accuracy Boost: The addition of reasoning chains through SFT training improved accuracies by 14-20 percentage points across various models compared to baselines. Reinforcement Learning further boosted these accuracies by an additional 3-14 percentage points.
  • Impact on Question Types: Questions requiring multi-step thinking, such as those about fusion dishes, cooking techniques, ingredient substitutions, and allergens, saw the biggest gains.
  • Knowledge Augmentation: While external domain knowledge from a Knowledge Graph (IndiFoodKG) was explored, its impact was mixed. It improved performance for categories like ingredients and health but could sometimes distract the model for other question types.

The best-performing model, Qwen2.5-VL-3B-Instruct, achieved an impressive accuracy of 71.12% with DPO RL training, setting a new state-of-the-art on the IndiFoodVQA benchmark.

Also Read:

Conclusion

This research underscores the critical role of multi-step reasoning in Visual Question Answering for complex and culturally diverse domains like Indian cuisine. By automatically synthesizing and validating reasoning chains, and then leveraging supervised fine-tuning and reinforcement learning, the authors have developed a highly effective method that significantly outperforms previous approaches. This work paves the way for more intelligent AI systems that can truly understand and assist with the nuances of global culinary traditions.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -