IIT-Bombay Unveils Extensive AI Datasets to Propel India-Centric Language and Cultural Research

TLDR: IIT-Bombay has released 16 culturally significant AI datasets on AIKosh, the central government’s AI repository, as part of the BharatGen initiative. These datasets, including digitized ancient texts, multilingual language resources, and audio-visual content, aim to build sovereign AI models for India, fostering innovation and research tailored to the nation’s diverse linguistic and cultural landscape.

Mumbai, India – In a landmark move to bolster India’s artificial intelligence (AI) capabilities, the Indian Institute of Technology (IIT) Bombay has announced the release of 16 diverse and culturally significant AI datasets on AIKosh, the central government’s official AI repository. This initiative marks a pivotal step towards developing AI models that deeply understand and cater to India’s unique linguistic and cultural nuances.

Launched in March, AIKosh serves as a national platform designed to support inclusive AI development across the country by providing a comprehensive repository of datasets, models, and toolkits. IIT-Bombay stands as a leading contributor to this platform, spearheading the BharatGen consortium, a collaborative effort funded by the Department of Science and Technology. BharatGen comprises seven premier institutions, including IIT Kanpur, IIT Madras, IIT Hyderabad, IIT Mandi, IIM Indore, and IIIT Hyderabad, and has collectively contributed 37 diverse models and datasets to AIKosh, with IIT-Bombay alone responsible for 16 of these culturally rich datasets.

The newly released datasets are meticulously curated to address critical gaps in India-centric AI research. A significant contribution includes the digitization of 30 ancient textbooks, some dating back as far as 18 centuries, covering subjects like astronomy, medicine, and mathematics. This monumental effort has yielded a dataset of approximately 218,000 sentences and 1.5 million words, now openly accessible to researchers. Other key datasets encompass:

Language Translation: Over 53,000 sentences for English-Sanskrit translations, focusing on modern prose.

Speech Recognition: More than 78 hours of Sanskrit audio data to enhance speech recognition systems.

Multilingual Q&A: Question-answer sets in 11 Indian languages, including Hindi and English.

Reasoning: Math word problems in Hindi and English to improve AI’s reasoning capabilities.

Document Processing: Table detection datasets across 14 Indian languages, alongside handwritten and printed Indian scripts for advanced Optical Character Recognition (OCR) and Natural Language Processing (NLP).

Multimodal Content: Audio-visual data on practical skills such as upcycling discarded materials and organic farming, image-based question answering, and video-text recognition, including a unique dataset derived from the works of historian Dharampal.

Surveillance: Drone surveillance imagery to boost AI capabilities in smart agriculture, disaster management, and border security.

Professor Ganesh Ramakrishnan from IIT-Bombay, who leads this ambitious project, emphasized the strategic vision behind these efforts. “We are not only researching Large Language Models (LLMs) and other generative models for AI that are effective and data and compute efficient, but also building sovereign models for India from the ground up,” stated Prof. Ramakrishnan. He further added, “We are creating datasets for training these models and fine-tuning them for downstream tasks such as conversation and question-answering, while creating benchmarking datasets towards calibrating the performance of these models.”

This initiative is not merely about fine-tuning existing global models but about training new ones from scratch using Indian data, ensuring cultural and linguistic relevance. Prof. Ramakrishnan highlighted, “This is about setting benchmarks for the AI ecosystem in India,” noting that these resources are openly available to researchers, enterprises, and academic institutions, thereby democratizing AI access across the country. The goal is to foster innovations that address local problems, from automating handwritten Indian forms to developing speech interfaces for rural populations, ultimately building inclusive AI models that reflect India’s socio-cultural realities.

Also Read:

The release of these datasets aligns seamlessly with the broader India AI Mission and the Ministry of Electronics and Information Technology’s (MeitY) Digital India initiative, aiming to build a self-reliant and inclusive AI ecosystem. By making high-quality, India-centric data openly accessible, IIT-Bombay and BharatGen are paving the way for a future where AI truly speaks India’s languages, understands its diverse contexts, and solves problems rooted in its unique environment.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

IIT-Bombay Unveils Extensive AI Datasets to Propel India-Centric Language and Cultural Research

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

A New Way to Disentangle Data for Scientific Exploration

SiegPath Honored with ‘Most Innovative Fintech Award’ at AI Expo Europe 2025 for AI-Driven Solutions

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

OneShield Achieves Landmark Registration Under Cloud Security Alliance AI Controls Matrix, Setting New Industry Standard

SeedAI Leads Utah’s Proactive Initiative for Ethical AI Integration in Business

Bahrain Commended for AI Preparedness in New UNESCO Global Report

U.S. Air Force Secures Skydio Drone Technology for Enhanced Autonomous Operations

Malaysia Forges Ahead with AI Development, Prioritizing Governance and Ethical Frameworks

Contractify Honored as Top Contract Management Solution Provider for 2025 by LegalTech Breakthrough Awards

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Astreya Unveils New Wave of Enterprise AI Agents to Boost Business Efficiency and Automation

EPAM Honored with Microsoft’s 2025 Innovate with Azure AI Platform Partner of the Year Award for Pioneering AI Solutions

EBU Academy’s School of AI Honored with European Digital Skills Award for Upskilling Media Professionals

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Netherlands Unveils Ambitious AI Strategy to Shape Global Governance Frameworks

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Prepify AI and ZoraSafe, Inc. Honored with ‘Panelists’ Choice’ Awards at UF Innovate’s GatorPitch in Miami

Subscribe to get the latest news and updates