TLDR: BhashaBench V1 is a new, comprehensive benchmark designed to evaluate large language models (LLMs) on India-specific knowledge across four key domains: Agriculture, Legal, Finance, and Ayurveda. It features over 74,000 question-answer pairs in English and Hindi, sourced from authentic exams. Evaluations of 29+ LLMs reveal significant performance gaps across domains and languages, with models struggling more in traditional Indian knowledge systems like Ayurveda and performing better in English than Hindi. The benchmark highlights the need for specialized LLM development tailored to India’s diverse linguistic and cultural contexts.
Large language models, or LLMs, are rapidly changing the landscape of artificial intelligence, impacting fields from healthcare to finance. However, a significant challenge remains: most existing evaluation tools for these models are heavily focused on English and general knowledge. This creates a gap when assessing how well LLMs understand and perform in specific cultural and domain contexts, especially for regions like India with its rich diversity of languages and knowledge systems.
To address this crucial need, a new benchmark called BhashaBench V1 has been introduced. This is the first comprehensive, domain-specific, and multi-task benchmark designed specifically for evaluating LLMs on India-centric knowledge. It focuses on four vital domains that are central to Indian society and economy: Agriculture (BBK), Legal (BBL), Finance (BBF), and Ayurveda (BBA).
BhashaBench V1 is a substantial dataset, containing 74,166 carefully selected question-answer pairs. Of these, 52,494 are in English and 21,672 are in Hindi, reflecting the widespread use of both languages in India. The questions are not just theoretical; they are sourced from authentic government and professional examinations, ensuring they represent real-world scenarios and challenges faced by practitioners in these fields. The benchmark is highly detailed, covering over 90 subdomains and more than 500 specific topics, allowing for a very precise evaluation of model performance.
An extensive evaluation was conducted on over 29 state-of-the-art LLMs using BhashaBench V1. The results revealed significant differences in how models perform across different domains and languages. For example, even top-performing models like GPT-4o showed a notable disparity, achieving 76.49% accuracy in the Legal domain but only 59.74% in Ayurveda. This highlights the particular difficulty models have with traditional Indian knowledge systems. Consistently, models performed better on English content compared to Hindi across all domains, indicating a language-specific performance gap.
Further analysis at the subdomain level provided more granular insights. Areas such as Cyber Law and International Finance generally saw stronger performance from LLMs. In contrast, traditional and specialized domains like Panchakarma (an Ayurvedic therapy), Seed Science, and Human Rights proved to be notably challenging for current models. The benchmark also categorized questions by difficulty (Easy, Medium, Hard) and type (Multiple Choice, Assertion/Reasoning, Fill in the Blanks, Match the Column, Reading Comprehension, Rearrange the Sequence). Larger and instruction-tuned models consistently outperformed smaller models, especially on harder questions and more complex question types like Assertion/Reasoning and Reading Comprehension.
Interestingly, when comparing models within the GPT family, the gpt-oss-120b model significantly outperformed GPT-4o in the Finance domain (71.05% vs. 54.97%). This suggests that raw parameter size alone doesn’t guarantee superior performance; architectural choices and training methodologies, particularly for mathematical reasoning, play a crucial role. For smaller models (under 4 billion parameters), Param-1 and Qwen2.5-3B emerged as leading performers, demonstrating that efficient architectures and targeted optimizations can still yield reasonable results in resource-constrained environments.
Also Read:
- Unlocking Deeper Insights: A New Framework for Corpus-Level AI Reasoning
- Balancing Global Knowledge and Local Culture in Multilingual AI Models
In conclusion, BhashaBench V1 serves as a vital tool for assessing and advancing LLMs for India-centric applications. It underscores the urgent need for developing specialized models that integrate India-specific knowledge, cultural contexts, and robust multilingual capabilities. All code, benchmarks, and resources for BhashaBench V1 are publicly available to support open research and accelerate progress towards more inclusive and culturally aware language models. You can find more details about this research paper here.


