AI's Culinary Insight: Enhancing Visual Question Answering for Indian Food with Reasoning Chains

TLDR: This research introduces ‘Thought-For-Food,’ a novel approach to improve Visual Question Answering (VQA) for Indian cuisine, which is often overlooked by Western-biased AI systems. By automatically generating and validating multi-step reasoning chains, and then training models using supervised fine-tuning and reinforcement learning, the system significantly boosts accuracy (up to 71.12%). The method helps AI better understand complex culinary contexts and relationships in diverse Indian dishes, demonstrating the efficacy of reasoning-driven approaches for culturally varied food domains.

In the rapidly evolving world of Artificial Intelligence, Visual Question Answering (VQA) systems have shown remarkable capabilities in interpreting images and answering questions about them. However, a significant gap exists when these systems encounter the rich and diverse culinary landscape of India. Traditional VQA models often fall short because they are primarily trained on Western food, struggling with the unique complexities of Indian dishes.

A new research paper titled “Thought-For-Food: Reasoning Chain Induced Food Visual Question Answering” addresses this challenge head-on. Authored by Riddhi Jain, Manasi Patwardhan, Parijat Deshpande, and Venkataramana Runkana from TCS-Research, this work proposes a novel approach to enhance AI’s understanding of Indian food through structured, multi-step reasoning. You can read the full paper here.

The Unique Challenge of Indian Cuisine for AI

Indian food is incredibly diverse, influenced by geography, religion, and traditions. A single meal can feature items differing in preparation, presentation, and flavor. This richness poses unique challenges for AI systems. Existing Food VQA systems, even those attempting to cover Indian food like IndiFoodVQA, often follow a two-step process: generating an answer first, then an explanation. The researchers argue that for Indian food, an accurate answer often requires a multi-step reasoning process, understanding complex culinary contexts and relationships between various food items.

Introducing Reasoning Chains for Food VQA

The core innovation of this research is the creation of “reasoning chains.” Instead of directly predicting an answer, the models are trained to follow a logical sequence of sub-questions and sub-answers that lead to the final correct response. This mimics human thought processes when analyzing a complex visual scene and answering a question about it. These reasoning chains are synthesized with minimal human intervention, making the process scalable.

How the System Works: A Two-Stage Training Approach

The methodology involves two main stages:

1. Supervised Fine-Tuning (SFT): The process begins by using the IndiFoodVQA dataset, which provides images of Indian food, questions, answer choices, and reasons. The researchers augment this dataset by generating step-wise reasoning chains. They use Vision-Language Models (VLMs) to identify food items and their positions in an image, and then Large Language Models (LLMs) to generate the reasoning chains based on these visual cues and a few human-annotated examples. These synthesized reasoning chains are then validated, and only the ones leading to the correct answer are used to fine-tune smaller LLMs and VLMs. This ensures the models learn to generate logical, step-by-step explanations.

2. Reinforcement Learning (RL): Following SFT, the models undergo further optimization using reinforcement learning techniques, specifically Direct Preference Optimization (DPO) and Group Relative Preference Optimization (GRPO). In this stage, the models are rewarded for generating reasoning chains that lead to the correct answer and penalized for incorrect ones. This allows the models to learn from a larger dataset, including reasoning chains that might not have been validated in the SFT stage, making them more robust and consistent.

Key Findings and Performance Improvements

The research yielded several important insights and demonstrated significant performance gains:

Inherent Understanding: Initial “zero-shot” tests showed low accuracy (31-51%), indicating that models lack an inherent understanding of Indian cuisine without specific training.
Vision vs. Language Models: Vision-Language Models (VLMs) generally outperformed pure Language Models (LLMs) in zero-shot scenarios, likely because VLMs can directly process image information, while LLMs rely on extracted visual descriptions.
Power of Reasoning: Models specifically tuned for reasoning, like DeepSeek, showed remarkable improvement (up to 95.04%) after reasoning-aligned SFT and RL training, highlighting their adaptability to this task.
Significant Accuracy Boost: The addition of reasoning chains through SFT training improved accuracies by 14-20 percentage points across various models compared to baselines. Reinforcement Learning further boosted these accuracies by an additional 3-14 percentage points.
Impact on Question Types: Questions requiring multi-step thinking, such as those about fusion dishes, cooking techniques, ingredient substitutions, and allergens, saw the biggest gains.
Knowledge Augmentation: While external domain knowledge from a Knowledge Graph (IndiFoodKG) was explored, its impact was mixed. It improved performance for categories like ingredients and health but could sometimes distract the model for other question types.

The best-performing model, Qwen2.5-VL-3B-Instruct, achieved an impressive accuracy of 71.12% with DPO RL training, setting a new state-of-the-art on the IndiFoodVQA benchmark.

Also Read:

Conclusion

This research underscores the critical role of multi-step reasoning in Visual Question Answering for complex and culturally diverse domains like Indian cuisine. By automatically synthesizing and validating reasoning chains, and then leveraging supervised fine-tuning and reinforcement learning, the authors have developed a highly effective method that significantly outperforms previous approaches. This work paves the way for more intelligent AI systems that can truly understand and assist with the nuances of global culinary traditions.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI’s Culinary Insight: Enhancing Visual Question Answering for Indian Food with Reasoning Chains

The Unique Challenge of Indian Cuisine for AI

Introducing Reasoning Chains for Food VQA

How the System Works: A Two-Stage Training Approach

Key Findings and Performance Improvements

Conclusion

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates