TLDR: IIT-Bombay has released 16 culturally significant AI datasets on AIKosh, the central government’s AI repository, as part of the BharatGen initiative. These datasets, including digitized ancient texts, multilingual language resources, and audio-visual content, aim to build sovereign AI models for India, fostering innovation and research tailored to the nation’s diverse linguistic and cultural landscape.
Mumbai, India – In a landmark move to bolster India’s artificial intelligence (AI) capabilities, the Indian Institute of Technology (IIT) Bombay has announced the release of 16 diverse and culturally significant AI datasets on AIKosh, the central government’s official AI repository. This initiative marks a pivotal step towards developing AI models that deeply understand and cater to India’s unique linguistic and cultural nuances.
Launched in March, AIKosh serves as a national platform designed to support inclusive AI development across the country by providing a comprehensive repository of datasets, models, and toolkits. IIT-Bombay stands as a leading contributor to this platform, spearheading the BharatGen consortium, a collaborative effort funded by the Department of Science and Technology. BharatGen comprises seven premier institutions, including IIT Kanpur, IIT Madras, IIT Hyderabad, IIT Mandi, IIM Indore, and IIIT Hyderabad, and has collectively contributed 37 diverse models and datasets to AIKosh, with IIT-Bombay alone responsible for 16 of these culturally rich datasets.
The newly released datasets are meticulously curated to address critical gaps in India-centric AI research. A significant contribution includes the digitization of 30 ancient textbooks, some dating back as far as 18 centuries, covering subjects like astronomy, medicine, and mathematics. This monumental effort has yielded a dataset of approximately 218,000 sentences and 1.5 million words, now openly accessible to researchers. Other key datasets encompass:
Language Translation: Over 53,000 sentences for English-Sanskrit translations, focusing on modern prose.
Speech Recognition: More than 78 hours of Sanskrit audio data to enhance speech recognition systems.
Multilingual Q&A: Question-answer sets in 11 Indian languages, including Hindi and English.
Reasoning: Math word problems in Hindi and English to improve AI’s reasoning capabilities.
Document Processing: Table detection datasets across 14 Indian languages, alongside handwritten and printed Indian scripts for advanced Optical Character Recognition (OCR) and Natural Language Processing (NLP).
Multimodal Content: Audio-visual data on practical skills such as upcycling discarded materials and organic farming, image-based question answering, and video-text recognition, including a unique dataset derived from the works of historian Dharampal.
Surveillance: Drone surveillance imagery to boost AI capabilities in smart agriculture, disaster management, and border security.
Professor Ganesh Ramakrishnan from IIT-Bombay, who leads this ambitious project, emphasized the strategic vision behind these efforts. “We are not only researching Large Language Models (LLMs) and other generative models for AI that are effective and data and compute efficient, but also building sovereign models for India from the ground up,” stated Prof. Ramakrishnan. He further added, “We are creating datasets for training these models and fine-tuning them for downstream tasks such as conversation and question-answering, while creating benchmarking datasets towards calibrating the performance of these models.”
This initiative is not merely about fine-tuning existing global models but about training new ones from scratch using Indian data, ensuring cultural and linguistic relevance. Prof. Ramakrishnan highlighted, “This is about setting benchmarks for the AI ecosystem in India,” noting that these resources are openly available to researchers, enterprises, and academic institutions, thereby democratizing AI access across the country. The goal is to foster innovations that address local problems, from automating handwritten Indian forms to developing speech interfaces for rural populations, ultimately building inclusive AI models that reflect India’s socio-cultural realities.
Also Read:
- Hindustan Times Highlights Abundance of Free Generative AI Courses for All Skill Levels
- India Invites BRICS Nations to Forthcoming AI Impact Summit
The release of these datasets aligns seamlessly with the broader India AI Mission and the Ministry of Electronics and Information Technology’s (MeitY) Digital India initiative, aiming to build a self-reliant and inclusive AI ecosystem. By making high-quality, India-centric data openly accessible, IIT-Bombay and BharatGen are paving the way for a future where AI truly speaks India’s languages, understands its diverse contexts, and solves problems rooted in its unique environment.


