TLDR: KAME is a novel hybrid architecture for real-time speech-to-speech (S2S) conversational AI that combines the low latency of S2S models with the deep knowledge of large language models (LLMs). It uses a front-end S2S transformer for immediate responses and concurrently feeds user queries to a back-end LLM. The LLM’s text-based responses are injected in real-time to guide the S2S model, significantly improving response correctness without increasing latency, effectively bridging the performance gap between monolithic S2S and high-latency cascaded systems.
In the rapidly evolving world of artificial intelligence, real-time conversational systems are at the forefront of creating more natural interactions between humans and machines. However, developing AI that can respond instantly while also possessing deep knowledge has been a significant challenge. Traditional real-time speech-to-speech (S2S) models offer low latency but often lack comprehensive understanding, while more knowledgeable cascaded systems, which combine speech recognition, a large language model (LLM), and text-to-speech, suffer from delays that disrupt conversation flow.
Understanding the Challenge
The core dilemma lies in balancing speed with intelligence. Monolithic S2S models, like Moshi, are fast because they process speech end-to-end without needing to synchronize with other systems. However, their capacity is stretched thin trying to capture both verbal content and non-verbal cues like emotion, making knowledge acquisition less efficient compared to text-only LLMs. On the other hand, cascaded systems, while excellent at integrating vast knowledge by leveraging advanced LLMs, introduce noticeable latency. This delay occurs because they must wait for a user’s complete utterance before processing it, leading to a less natural conversational experience.
KAME’s Innovative Approach
To bridge this gap, researchers from Sakana AI have introduced a novel hybrid architecture called KAME (Knowledge-Access Model Extension). KAME operates as a “tandem” system, combining the best of both worlds: the immediate responsiveness of an S2S model with the extensive knowledge of a powerful back-end LLM. The name KAME itself hints at its function: Knowledge-Access Model Extension.
How KAME Operates
The KAME architecture features two main components: a front-end S2S transformer and a back-end text-based LLM. When a user speaks, the front-end S2S model immediately begins processing the speech and generating an initial response, ensuring low latency. Simultaneously, the user’s speech is streamed to a back-end LLM, which works to formulate a more knowledgeable and refined text-based response. This LLM-generated text, referred to as an “oracle stream,” is then injected in real-time back into the front-end S2S model. The front-end model is specifically trained to condition its speech output on both its internal context and this incoming oracle guidance, effectively infusing its output with rich knowledge without incurring the full latency penalty of a cascaded system. This allows KAME to start responding quickly and then refine its output as more information from the LLM becomes available.
Training KAME for Real-World Conversations
A significant challenge in developing KAME was creating appropriate training data, as natural conversations with real-time evolving oracle tokens don’t readily exist. The researchers devised a clever solution: “simulated oracle augmentation.” They converted standard two-party dialogue datasets into the required format. This process involves generating simulated oracle text that mimics how a real-time LLM would behave. Early in a user’s utterance, the simulated oracle provides a general, plausible sentence. As more of the input is processed, the simulated oracle progressively refines, becoming more specific and accurate, eventually converging to the ground-truth response by the time the user finishes speaking. This method ensures the front-end S2S model learns to effectively integrate the evolving guidance from the back-end LLM.
Performance and Impact
Evaluations using a speech-synthesized variant of the MT-Bench benchmark demonstrated KAME’s effectiveness. The system substantially outperformed a baseline S2S model (Moshi) in response correctness, with its MT-Bench score improving from 2.05 to 6.43, while maintaining a median latency of 0.0 seconds, on par with the baseline. This means KAME can start responding before the user even finishes their question. While KAME’s quality score was slightly lower than a fully cascaded system like Unmute (which achieved 7.70 with a 2.1-second latency), this difference is primarily attributed to KAME’s deliberate choice to generate early responses. Further analysis showed that the back-end LLM’s capability in KAME is comparable to that in cascaded systems, indicating the quality gap is due to the timing of its early, proactive responses rather than a lack of knowledge. KAME also proved to be “back-end agnostic,” allowing for flexible selection of different LLMs (e.g., GPT-4.1 or Claude-opus-4.1) based on specific application needs. For more in-depth information, you can read the full research paper here.
Also Read:
- Flamed-TTS: Advancing Zero-Shot Text-to-Speech with Efficiency and Naturalness
- DiffuSpec: Accelerating LLM Inference with Diffusion Language Models
Conclusion
KAME represents a significant step forward in conversational AI, successfully integrating the advanced knowledge capabilities of LLMs with the crucial low-latency requirements of real-time speech-to-speech systems. By introducing oracle tokens and a practical training methodology, KAME offers an effective and balanced solution for building powerful, responsive, and intelligent conversational AI experiences.


