TLDR: A research paper by NetoAI proposes a low-latency, end-to-end voice-to-voice communication pipeline for telecommunications. It integrates streaming Automatic Speech Recognition (ASR), a 4-bit quantized Large Language Model (LLM), Retrieval-Augmented Generation (RAG) over telecom documents, and real-time Text-to-Speech (TTS). The system achieves an average response time under 1 second, making it highly suitable for real-time interactive scenarios such as call center automation and conversational IVR systems.
In the rapidly evolving landscape of telecommunications, the demand for highly responsive and intelligent voice agents is growing. Traditional systems often suffer from noticeable delays, making conversations feel unnatural and frustrating. A recent research paper by Vignesh Ethiraj, Ashwath David, Sidhanth Menon, and Divya Vijay from NetoAI addresses this critical challenge by proposing a novel, low-latency, end-to-end voice-to-voice communication pipeline specifically designed for real-time interactive telecom scenarios.
The core innovation lies in integrating several advanced AI components: streaming Automatic Speech Recognition (ASR), a highly efficient 4-bit quantized Large Language Model (LLM), Retrieval-Augmented Generation (RAG) over telecom-specific documents, and real-time Text-to-Speech (TTS). This combination aims to deliver responsive, knowledge-grounded spoken interactions, ideal for applications like call center automation and conversational Interactive Voice Response (IVR) systems.
The Pipeline’s Architecture and Key Innovations
The proposed system is built on a modular, multi-threaded architecture that prioritizes minimizing delays at every step. Here’s a breakdown of its key components and the techniques employed:
-
Streaming ASR: The pipeline utilizes NetoAI’s proprietary T-Transcribe Engine (TTE), a Conformer-based model optimized for real-time speech recognition. This ASR module efficiently transcribes audio into text, designed to balance accuracy with very low latency, making it suitable for continuous streaming.
-
Retrieval-Augmented Generation (RAG): To ensure the LLM’s responses are accurate and contextually relevant, the system incorporates a RAG submodule. This uses FAISS for rapid similarity searches over a vast index of telecom documents. When a user speaks, the ASR transcript is used to retrieve relevant information from these documents, which then provides crucial context to the LLM. This process is designed to be extremely fast, maintaining sub-second retrieval latency.
-
Quantized Large Language Model (LLM): At the heart of the system is NetoAI’s TSLAM-Mini-2B LLM. A significant innovation here is the use of 4-bit post-training quantization. This technique drastically reduces the LLM’s memory footprint and speeds up inference without significantly compromising the quality of its generated responses. The LLM is also designed for streaming generation, meaning it can send out sentences incrementally as they are formed.
-
Real-Time Text-to-Speech (TTS): The final stage involves converting the LLM’s text responses back into natural-sounding speech using NetoAI’s T-SYNTH TTS model. This module is optimized for real-time synthesis, with techniques like a warm-up routine to reduce initial latency and ensure smooth, continuous audio output.
A crucial aspect of this pipeline is its multi-threaded and concurrent execution. ASR, LLM, and TTS modules operate in parallel, coordinated by a non-blocking producer-consumer pattern. This means the LLM can start generating text while the ASR is still processing, and the TTS can begin synthesizing audio as soon as the LLM produces its first complete sentence. Techniques like binary serialization between the LLM and TTS further reduce overall pipeline time.
Performance and Evaluation
To rigorously test the system, the researchers created a custom dataset of 500 human-recorded telecommunications-related questions, sourced from RFC (Request for Comments) documents. This dataset allowed for a realistic evaluation of both latency and domain relevance, simulating real-world user queries to a telecom voice agent.
The results are promising: the pipeline achieved an average total latency of just 0.94 seconds per utterance, comfortably meeting the typical 1-second threshold for interactive systems. The ASR and TTS modules demonstrated very low mean latencies (around 0.05s and 0.28s respectively), while the LLM, though the most computationally intensive, averaged 0.67s per generation. Retrieval latency was almost negligible at 0.008s.
Streaming efficiency was also a key focus, with an average time-to-first-token (TTFT) of 0.106s and time-to-first-audio (TTFA) of 0.678s. This indicates that the system quickly begins generating text and audio after receiving input. The ASR processed at an average of 394 words/second, and the LLM generated tokens at 80 tokens/second, showcasing robust throughput for real-time applications. Semantic preservation, measured by cosine similarity between ASR transcripts and LLM outputs, averaged 0.87, indicating strong meaning retention.
Also Read:
- Enhancing LLM Communication with FlashCommunication V2’s Bit Splitting and Spike Reserving
- Bridging the Data Gap: How Pretraining Boosts Speech LLMs for Under-Resourced Languages
Conclusion and Future Outlook
This research demonstrates a significant step forward in developing low-latency, end-to-end voice agents for telecommunications. By combining streaming ASR, quantized LLMs, efficient RAG, and real-time TTS within a multi-threaded framework, the pipeline effectively reduces total system latency while maintaining high response quality and semantic relevance. The system’s performance makes it well-suited for demanding real-time interactive voice scenarios in customer support, diagnostics, and IVR replacement.
The researchers acknowledge that ASR inaccuracies, particularly with domain-specific terms, can impact downstream RAG performance. Future work will focus on adopting more specialized ASR models to mitigate these errors and enhance overall robustness. The open-sourcing of their dataset and methodology also lays a foundation for further research in scalable, low-latency spoken dialogue systems. For more details, you can refer to the full research paper here.


