TLDR: GanitBench is a novel bilingual (English and Hindi) benchmark featuring 1527 image-based mathematical questions sourced from Indian examinations (JEE Advanced, CBSE Boards). It evaluates Vision Language Models (VLMs) like GPT-4o mini and Claude 3 Haiku, revealing that current models struggle with mathematical reasoning, particularly in Hindi. A ‘Double Lock’ constraint, requiring correct answers in both languages, significantly reduced performance, highlighting the need for VLMs to improve their cross-lingual reasoning abilities.
Artificial intelligence models, especially those that can understand both images and text (Vision Language Models or VLMs), have made incredible strides in recent years. However, a significant challenge remains: evaluating their ability to perform complex reasoning, particularly in mathematics, across different languages. Most existing benchmarks are primarily in English, leaving a gap for other languages like Hindi.
Addressing this crucial need, a new research paper introduces GanitBench, a challenging bilingual benchmark designed to assess mathematical reasoning in VLMs. This benchmark comprises 1527 vision-only questions, meaning the questions are presented as images that include both figures and text. These questions cover various topics in mathematics and are available in both English and Hindi.
The questions for GanitBench were carefully collected from two major Indian examinations: the JEE Advanced and the CBSE Boards examinations. These are widely taken by students in India, and their official question papers are openly provided in both English and Hindi, ensuring authentic, untranslated sources for the Hindi questions.
The researchers evaluated two prominent closed-source models, GPT-4o mini and Claude 3 Haiku, using GanitBench. They tested these models in two settings: zero-shot Chain-of-Thought (CoT) and two-shot CoT. Chain-of-Thought prompting encourages the model to generate step-by-step reasoning, mimicking human thought processes. In the two-shot setting, the models were provided with two example questions and their solutions to learn from.
A unique aspect of this evaluation was the introduction of a “Double Lock” constraint. Under this condition, a question was considered correctly solved only if the model provided the correct answer for both the English and Hindi versions of the same question. This stringent criterion aimed to specifically examine the models’ reasoning capabilities across languages.
Also Read:
- Developing Curriculum-Aligned Math Assessments Using Generative AI in Malaysia
- Evaluating Trust in AI: A New Benchmark for Multimodal Model Confidence
Key Findings from GanitBench
The evaluation revealed several important insights into the current capabilities of VLMs:
- Performance Disparity: GPT-4o mini emerged as the more dominant model, achieving a highest average accuracy of 38.15% in the zero-shot CoT setting. Claude 3 Haiku’s performance was significantly lower, around half of GPT-4o mini’s.
- Impact of Two-shot CoT: Surprisingly, the two-shot CoT setting did not consistently lead to an increase in performance for either model. In many cases, accuracies under this setting were lower compared to zero-shot CoT.
- Language Barrier: A significant observation was the decrease in performance when models answered questions in Hindi compared to their English equivalents. This suggests that VLMs still struggle with mathematical reasoning when dealing with languages other than English.
- “Double Lock” Challenge: The “Double Lock” constraint severely impacted the models’ accuracies. The highest accuracy dropped from 38.15% to 23.33% under this condition. This highlights that current VLMs face considerable difficulty in consistently providing correct solutions for the same problem across different languages.
The study concludes that while VLMs show promise, there’s a clear need for improvement in their mathematical reasoning capabilities, especially in multilingual contexts. GanitBench serves as a vital tool for future research, facilitating the inclusion of languages like Hindi in AI development and pushing for more robust and linguistically diverse models. The full research paper can be found here.


