BhashaBench V1: Evaluating Language Models for India's Diverse Knowledge Systems

TLDR: BhashaBench V1 is a new, comprehensive benchmark designed to evaluate large language models (LLMs) on India-specific knowledge across four key domains: Agriculture, Legal, Finance, and Ayurveda. It features over 74,000 question-answer pairs in English and Hindi, sourced from authentic exams. Evaluations of 29+ LLMs reveal significant performance gaps across domains and languages, with models struggling more in traditional Indian knowledge systems like Ayurveda and performing better in English than Hindi. The benchmark highlights the need for specialized LLM development tailored to India’s diverse linguistic and cultural contexts.

Large language models, or LLMs, are rapidly changing the landscape of artificial intelligence, impacting fields from healthcare to finance. However, a significant challenge remains: most existing evaluation tools for these models are heavily focused on English and general knowledge. This creates a gap when assessing how well LLMs understand and perform in specific cultural and domain contexts, especially for regions like India with its rich diversity of languages and knowledge systems.

To address this crucial need, a new benchmark called BhashaBench V1 has been introduced. This is the first comprehensive, domain-specific, and multi-task benchmark designed specifically for evaluating LLMs on India-centric knowledge. It focuses on four vital domains that are central to Indian society and economy: Agriculture (BBK), Legal (BBL), Finance (BBF), and Ayurveda (BBA).

BhashaBench V1 is a substantial dataset, containing 74,166 carefully selected question-answer pairs. Of these, 52,494 are in English and 21,672 are in Hindi, reflecting the widespread use of both languages in India. The questions are not just theoretical; they are sourced from authentic government and professional examinations, ensuring they represent real-world scenarios and challenges faced by practitioners in these fields. The benchmark is highly detailed, covering over 90 subdomains and more than 500 specific topics, allowing for a very precise evaluation of model performance.

An extensive evaluation was conducted on over 29 state-of-the-art LLMs using BhashaBench V1. The results revealed significant differences in how models perform across different domains and languages. For example, even top-performing models like GPT-4o showed a notable disparity, achieving 76.49% accuracy in the Legal domain but only 59.74% in Ayurveda. This highlights the particular difficulty models have with traditional Indian knowledge systems. Consistently, models performed better on English content compared to Hindi across all domains, indicating a language-specific performance gap.

Further analysis at the subdomain level provided more granular insights. Areas such as Cyber Law and International Finance generally saw stronger performance from LLMs. In contrast, traditional and specialized domains like Panchakarma (an Ayurvedic therapy), Seed Science, and Human Rights proved to be notably challenging for current models. The benchmark also categorized questions by difficulty (Easy, Medium, Hard) and type (Multiple Choice, Assertion/Reasoning, Fill in the Blanks, Match the Column, Reading Comprehension, Rearrange the Sequence). Larger and instruction-tuned models consistently outperformed smaller models, especially on harder questions and more complex question types like Assertion/Reasoning and Reading Comprehension.

Interestingly, when comparing models within the GPT family, the gpt-oss-120b model significantly outperformed GPT-4o in the Finance domain (71.05% vs. 54.97%). This suggests that raw parameter size alone doesn’t guarantee superior performance; architectural choices and training methodologies, particularly for mathematical reasoning, play a crucial role. For smaller models (under 4 billion parameters), Param-1 and Qwen2.5-3B emerged as leading performers, demonstrating that efficient architectures and targeted optimizations can still yield reasonable results in resource-constrained environments.

Also Read:

In conclusion, BhashaBench V1 serves as a vital tool for assessing and advancing LLMs for India-centric applications. It underscores the urgent need for developing specialized models that integrate India-specific knowledge, cultural contexts, and robust multilingual capabilities. All code, benchmarks, and resources for BhashaBench V1 are publicly available to support open research and accelerate progress towards more inclusive and culturally aware language models. You can find more details about this research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

BhashaBench V1: Evaluating Language Models for India’s Diverse Knowledge Systems

Gen AI News and Updates

Infibeam Avenues Reports Stellar 93% Revenue Growth, Pivots to AI-Driven Payment Solutions

MUFG Forges Alliance with OpenAI to Revolutionize Banking with Generative AI

AirTrunk, Backed by Blackstone, Fuels India’s AI Boom with Major Data Center Expansion

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates