TLDR: FLM-Audio is a 7-billion parameter spoken dialogue model that enhances native full-duplex chatbots by introducing “natural monologues” and a “dual training paradigm.” Natural monologues enable continuous text generation, preserving language ability and reducing data annotation costs, a significant improvement over previous word-level alignment methods. The dual training, which alternates between the monologue leading and trailing the audio, allows the model to effectively manage the asynchronous nature of text and speech. FLM-Audio demonstrates superior responsiveness, naturalness, and robustness in real-time conversations, outperforming existing baselines despite being trained on less data, and is available as an open-source model.
In the rapidly evolving world of artificial intelligence, the dream of truly human-like conversations with AI systems is becoming a reality. One of the most significant challenges in achieving this is enabling AI chatbots to listen and speak simultaneously, much like humans do. This capability, known as full-duplex communication, is crucial for creating responsive and natural interactions. A new research paper introduces FLM-Audio, a 7-billion parameter spoken dialogue model that aims to set a new standard for native full-duplex chatbots.
The Challenge of Real-Time AI Conversations
Traditional AI conversation models often suffer from noticeable delays because they process input and generate responses sequentially. This is akin to a walkie-talkie conversation where one person speaks, then waits for the other to finish before responding. This method, called Time-Division Multiplexing (TDM), leads to high response latency, sometimes up to two seconds, making conversations feel unnatural and clunky. While some models have tried to improve this, they often face limitations in scalability and the length of audio they can handle.
A more advanced approach is Native Full-duplexity, where the AI listens and speaks at the same time. This significantly reduces response latency, bringing it down to as little as 80 milliseconds. However, a major hurdle remains: how to accurately align the AI’s internal “thought process” (textual monologue) with the incoming and outgoing audio streams, which operate at very different speeds. Previous solutions often relied on word-by-word alignment, which can degrade the AI’s language understanding and generation abilities, and requires extremely precise (and costly) timing information for every single word.
Introducing Natural Monologues and Dual Training
The researchers behind FLM-Audio propose an innovative solution called “natural monologues.” Instead of breaking down sentences into individual words and aligning them precisely with audio, FLM-Audio generates continuous sequences of text, like full sentences or paragraphs, in its internal monologue. This mimics how humans think and plan their speech, where thoughts often precede spoken words. The model then uses special “wait” tokens to fill any gaps until the corresponding speech is completed or interrupted.
This approach offers several key advantages. Firstly, it preserves the powerful language modeling capabilities of large pre-trained AI models, leading to more coherent and natural dialogue. Secondly, it drastically reduces the cost and complexity of data preparation, as it only requires sentence-level alignment between audio and text, rather than word-level timestamps. This also makes the system less prone to errors that can cascade from misaligned words.
To effectively train FLM-Audio with natural monologues, the team developed a “dual training paradigm.” This involves alternating the position of the natural monologue relative to the audio stream during different training stages. Sometimes the monologue “leads” the audio (like in text-to-speech generation), and other times it “follows” the audio (like in automatic speech recognition). This dual approach helps the model learn to handle the inherent asynchronous nature of text and speech, enabling it to generate both coherent internal monologues and human-like spoken responses.
Also Read:
- Enhancing Omni-Modal Language Models: A New Framework to Combat Hallucinations
- Smart Planning for LLM Agents: Balancing Speed and Expense
FLM-Audio: Performance and Availability
FLM-Audio, a 7-billion parameter model, was rigorously tested against existing state-of-the-art full-duplex chatbots. Despite being trained on a significantly smaller dataset than some comparable models, FLM-Audio demonstrated superior responsiveness, duplexity, and overall chatting experiences. It showed strong performance in audio understanding tasks like Automatic Speech Recognition (ASR) and spoken question answering, particularly in Chinese. In audio generation, it achieved word error rates comparable to specialized text-to-speech systems.
Crucially, in human evaluations of full-duplex chatting, FLM-Audio was rated highly for its naturalness, responsiveness to interruptions, and robustness in noisy environments, matching or surpassing other leading streaming chatbots in overall quality. An ablation study confirmed the importance of the dual training paradigm, showing a significant drop in performance when the ASR-style supervision was omitted.
The researchers have made FLM-Audio an open-source model, along with its inference and interaction pipeline, encouraging further development and exploration in the field of native full-duplex AI. You can find more details in their research paper: FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training.
This work represents a significant step towards creating AI systems that can engage in conversations as fluidly and naturally as humans, paving the way for more advanced embodied AI and seamless human-AI interaction.


