TLDR: BanglaTalk is the first real-time speech assistant designed for diverse Bengali regional dialects. It uses a client-server architecture with Real-time Transport Protocol (RTP) for low-latency communication and a dialect-aware ASR system called BRDialect, which significantly outperforms existing models. The system is bandwidth-efficient (24 kbps) and offers a low end-to-end delay (4.9 seconds), making speech technology more accessible and interactive for Bengali speakers.
A groundbreaking new system called BanglaTalk is set to transform how Bengali speakers interact with technology, offering the first real-time speech assistance specifically designed for the language’s rich tapestry of regional dialects. This innovation addresses a significant gap, as existing speech assistants primarily focus on standard Bengali and often struggle with the diverse linguistic variations spoken by approximately 260 million people worldwide.
The core challenge in developing speech assistants for Bengali lies in its status as a low-resource language with considerable regional dialectal diversity. Previous systems have not been optimized for real-time use and fail to accurately interpret queries in regional dialects, leading to frustrating user experiences.
How BanglaTalk Works
BanglaTalk operates on a client-server architecture, ensuring efficient and low-latency communication. It leverages the Real-time Transport Protocol (RTP) to achieve its real-time capabilities. A key innovation is its dialect-aware Automatic Speech Recognition (ASR) system, named BRDialect. This system was developed by fine-tuning the IndicWav2Vec model across ten distinct Bengali regional dialects, allowing it to understand and transcribe a wide range of spoken Bengali accurately.
On the client side, where the user interacts, BanglaTalk integrates lightweight audio processing modules. These include noise cancellation to filter out background distractions, dynamic range compression to maintain consistent audio levels, and efficient audio encoding using the Opus codec. This ensures that the system can capture and prepare speech data effectively, even on devices with varying hardware capabilities.
The server-side handles the more computationally intensive tasks. After receiving and decoding the audio stream, a Voice Activity Detector (VAD) identifies speech segments, preventing unnecessary processing of silence. Once a complete user query is detected, BRDialect transcribes it into text. This text is then fed into a large language model (LLM), such as GPT-4.1-nano, which generates an appropriate response. Finally, a natural-sounding Text-to-Speech (TTS) system, specifically the VITS-Bengali model, converts the response back into speech, which is then encoded and sent back to the client.
Key Advantages and Performance
One of BanglaTalk’s most significant advantages is its ability to operate at a low bandwidth of just 24 kbps. This makes the system highly accessible and cost-effective, particularly for users in regions with limited or expensive internet access. Despite this low bandwidth usage, BanglaTalk maintains an impressive average end-to-end delay of only 4.9 seconds, ensuring interactive and natural conversations.
The BRDialect ASR system has demonstrated superior performance, outperforming baseline ASR models like Whisper-medium-Bengali and IndicWav2Vec-Bengali by a substantial margin. On the RegSpeech12 dataset, which covers twelve Bengali regional dialects, BRDialect achieved a Word Error Rate (WER) of 74.1% and a Character Error Rate (CER) of 40.6%. The VITS-Bengali TTS model further enhances the user experience by producing high-quality, natural-sounding speech with a Mean Opinion Score (MOS) of 4.49.
This research marks a crucial step towards inclusive and accessible speech technology for the diverse community of Bengali speakers, enabling them to interact with digital assistants in their native dialects. For more detailed information, you can refer to the full research paper. Read the full paper here.
Also Read:
- A Unified Approach to Enhancing Pronunciation Training with Multi-Faceted Feedback
- Advancing Thai Voice Agents: Semantic End-of-Turn Detection for Natural Conversations
Future Directions
While BanglaTalk represents a significant leap forward, the researchers acknowledge areas for future improvement. These include expanding the BRDialect ASR system to cover more regions of Bangladesh, adding the capability to handle user interruptions for more natural conversations, incorporating speaker verification to distinguish between speakers, and supporting multiple concurrent conversations to enhance system usability. A broader user study across all regions of Bangladesh is also planned to gather deeper insights into user acceptance and overall impact.


