GanitBench: A New Benchmark Uncovers AI's Multilingual Math Challenges

TLDR: GanitBench is a novel bilingual (English and Hindi) benchmark featuring 1527 image-based mathematical questions sourced from Indian examinations (JEE Advanced, CBSE Boards). It evaluates Vision Language Models (VLMs) like GPT-4o mini and Claude 3 Haiku, revealing that current models struggle with mathematical reasoning, particularly in Hindi. A ‘Double Lock’ constraint, requiring correct answers in both languages, significantly reduced performance, highlighting the need for VLMs to improve their cross-lingual reasoning abilities.

Artificial intelligence models, especially those that can understand both images and text (Vision Language Models or VLMs), have made incredible strides in recent years. However, a significant challenge remains: evaluating their ability to perform complex reasoning, particularly in mathematics, across different languages. Most existing benchmarks are primarily in English, leaving a gap for other languages like Hindi.

Addressing this crucial need, a new research paper introduces GanitBench, a challenging bilingual benchmark designed to assess mathematical reasoning in VLMs. This benchmark comprises 1527 vision-only questions, meaning the questions are presented as images that include both figures and text. These questions cover various topics in mathematics and are available in both English and Hindi.

The questions for GanitBench were carefully collected from two major Indian examinations: the JEE Advanced and the CBSE Boards examinations. These are widely taken by students in India, and their official question papers are openly provided in both English and Hindi, ensuring authentic, untranslated sources for the Hindi questions.

The researchers evaluated two prominent closed-source models, GPT-4o mini and Claude 3 Haiku, using GanitBench. They tested these models in two settings: zero-shot Chain-of-Thought (CoT) and two-shot CoT. Chain-of-Thought prompting encourages the model to generate step-by-step reasoning, mimicking human thought processes. In the two-shot setting, the models were provided with two example questions and their solutions to learn from.

A unique aspect of this evaluation was the introduction of a “Double Lock” constraint. Under this condition, a question was considered correctly solved only if the model provided the correct answer for both the English and Hindi versions of the same question. This stringent criterion aimed to specifically examine the models’ reasoning capabilities across languages.

Also Read:

Key Findings from GanitBench

The evaluation revealed several important insights into the current capabilities of VLMs:

Performance Disparity: GPT-4o mini emerged as the more dominant model, achieving a highest average accuracy of 38.15% in the zero-shot CoT setting. Claude 3 Haiku’s performance was significantly lower, around half of GPT-4o mini’s.
Impact of Two-shot CoT: Surprisingly, the two-shot CoT setting did not consistently lead to an increase in performance for either model. In many cases, accuracies under this setting were lower compared to zero-shot CoT.
Language Barrier: A significant observation was the decrease in performance when models answered questions in Hindi compared to their English equivalents. This suggests that VLMs still struggle with mathematical reasoning when dealing with languages other than English.
“Double Lock” Challenge: The “Double Lock” constraint severely impacted the models’ accuracies. The highest accuracy dropped from 38.15% to 23.33% under this condition. This highlights that current VLMs face considerable difficulty in consistently providing correct solutions for the same problem across different languages.

The study concludes that while VLMs show promise, there’s a clear need for improvement in their mathematical reasoning capabilities, especially in multilingual contexts. GanitBench serves as a vital tool for future research, facilitating the inclusion of languages like Hindi in AI development and pushing for more robust and linguistically diverse models. The full research paper can be found here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

GanitBench: A New Benchmark Uncovers AI’s Multilingual Math Challenges

Key Findings from GanitBench

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates