TLDR: MOSS-Speech is a novel large language model that enables direct speech-to-speech interaction without relying on intermediate text. It achieves this through a unique modality-based layer-splitting architecture and a frozen pre-training strategy, preserving the reasoning of text LLMs while adding native speech capabilities. The model demonstrates state-of-the-art performance in spoken question answering and maintains strong text performance, paving the way for more expressive and efficient spoken dialogue systems.
In the evolving landscape of artificial intelligence, spoken dialogue systems have long been a cornerstone of human-computer interaction. Traditionally, these systems operate through a cascaded pipeline: speech is first transcribed into text, a text-based large language model (LLM) processes it, and then the response is converted back into audio. While functional, this method often loses subtle paralinguistic cues like emotion and emphasis, and introduces latency.
Recent advancements have moved towards end-to-end methods, aiming to reduce latency and better preserve these cues. However, many still rely on text as an intermediate step, creating a fundamental bottleneck that limits expressivity and efficiency. Imagine trying to convey laughter or hesitation through text alone – it’s simply not the same.
A groundbreaking new research paper, MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance, introduces a novel approach that promises to overcome these limitations. Developed by the SLM Team from institutions including Shanghai Innovation Institute, Fudan University, and MOSI, MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without any text guidance.
A New Paradigm for Speech Interaction
The core innovation behind MOSS-Speech lies in its unique architecture and training strategy. The model combines a modality-based layer-splitting architecture with a frozen pre-training strategy. This sophisticated design allows the model to retain the vast reasoning and knowledge capabilities of pre-trained text LLMs while simultaneously integrating native speech understanding and generation abilities.
The researchers observed that in typical Transformer models, the alignment between speech and text representations tends to deteriorate in deeper layers. To address this, MOSS-Speech introduces a ‘modality-based layer split.’ After a certain number of shared layers where speech and text representations are fused, the model branches into modality-specific layers – one for text and one for speech. This ‘split-then-specialize’ approach ensures that the model can leverage joint multi-modality fusion in its lower layers while dedicating its final layers to modality-specific generation, enhancing cross-modality transfer.
Intelligent Speech Tokenization and Training
For speech processing, MOSS-Speech employs a carefully designed speech tokenizer. This tokenizer aims for a single-codebook, low-bitrate representation for efficient processing, while maximizing semantic content and preserving paralinguistic details. The encoder is trained using an automatic speech recognition (ASR) objective, ensuring it captures meaningful linguistic information. The decoder, based on a flow-matching architecture, is optimized for streaming operations, significantly reducing latency for real-time dialogue.
The training of MOSS-Speech involves a two-stage pre-training pipeline, starting with a powerful text LLM backbone, Qwen-3-8B. In the first stage, the text backbone is ‘frozen,’ and only the new speech-related components are trained. This initializes speech parameters and establishes a stable alignment with existing text representations. The second stage unfreezes a larger portion of the model, allowing for cross-modal adaptation, and incorporates additional text-only data to prevent any degradation of the model’s textual abilities.
Following pre-training, the model undergoes supervised fine-tuning using a large, synthetically constructed multimodal dataset. This dataset is meticulously adapted from existing text-based question-answering datasets, with non-vocal content converted and speech synthesized using advanced text-to-speech systems. Crucially, the fine-tuning incorporates four input-output modality configurations (speech-to-speech, speech-to-text, text-to-speech, text-to-text), enabling the model to handle diverse interaction types within a unified framework.
Also Read:
- Structured Emotion Graphs Enhance AI’s Understanding of Speech Emotion
- Advancing Emotional Text-to-Speech with Stepwise Preference Optimization
Promising Results and Future Outlook
Experiments demonstrate that MOSS-Speech achieves state-of-the-art results in spoken question answering. It delivers performance comparable to existing text-guided systems in speech-to-speech tasks, all while maintaining competitive performance in text-based tasks. This indicates that the model successfully bridges the gap between text-guided and direct speech generation, offering a path to more expressive and efficient end-to-end speech interaction.
By directly understanding and generating speech, MOSS-Speech avoids the inherent limitations of text intermediates, such as latency and restricted expressivity. This work establishes a new paradigm for spoken dialogue systems, envisioning a future where human-AI interaction is seamless, multimodal, and truly speech-native across various languages and contexts.


