MOSS-Speech: A New Era for Spoken AI Dialogue

TLDR: MOSS-Speech is a novel large language model that enables direct speech-to-speech interaction without relying on intermediate text. It achieves this through a unique modality-based layer-splitting architecture and a frozen pre-training strategy, preserving the reasoning of text LLMs while adding native speech capabilities. The model demonstrates state-of-the-art performance in spoken question answering and maintains strong text performance, paving the way for more expressive and efficient spoken dialogue systems.

In the evolving landscape of artificial intelligence, spoken dialogue systems have long been a cornerstone of human-computer interaction. Traditionally, these systems operate through a cascaded pipeline: speech is first transcribed into text, a text-based large language model (LLM) processes it, and then the response is converted back into audio. While functional, this method often loses subtle paralinguistic cues like emotion and emphasis, and introduces latency.

Recent advancements have moved towards end-to-end methods, aiming to reduce latency and better preserve these cues. However, many still rely on text as an intermediate step, creating a fundamental bottleneck that limits expressivity and efficiency. Imagine trying to convey laughter or hesitation through text alone – it’s simply not the same.

A groundbreaking new research paper, MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance, introduces a novel approach that promises to overcome these limitations. Developed by the SLM Team from institutions including Shanghai Innovation Institute, Fudan University, and MOSI, MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without any text guidance.

A New Paradigm for Speech Interaction

The core innovation behind MOSS-Speech lies in its unique architecture and training strategy. The model combines a modality-based layer-splitting architecture with a frozen pre-training strategy. This sophisticated design allows the model to retain the vast reasoning and knowledge capabilities of pre-trained text LLMs while simultaneously integrating native speech understanding and generation abilities.

The researchers observed that in typical Transformer models, the alignment between speech and text representations tends to deteriorate in deeper layers. To address this, MOSS-Speech introduces a ‘modality-based layer split.’ After a certain number of shared layers where speech and text representations are fused, the model branches into modality-specific layers – one for text and one for speech. This ‘split-then-specialize’ approach ensures that the model can leverage joint multi-modality fusion in its lower layers while dedicating its final layers to modality-specific generation, enhancing cross-modality transfer.

Intelligent Speech Tokenization and Training

For speech processing, MOSS-Speech employs a carefully designed speech tokenizer. This tokenizer aims for a single-codebook, low-bitrate representation for efficient processing, while maximizing semantic content and preserving paralinguistic details. The encoder is trained using an automatic speech recognition (ASR) objective, ensuring it captures meaningful linguistic information. The decoder, based on a flow-matching architecture, is optimized for streaming operations, significantly reducing latency for real-time dialogue.

The training of MOSS-Speech involves a two-stage pre-training pipeline, starting with a powerful text LLM backbone, Qwen-3-8B. In the first stage, the text backbone is ‘frozen,’ and only the new speech-related components are trained. This initializes speech parameters and establishes a stable alignment with existing text representations. The second stage unfreezes a larger portion of the model, allowing for cross-modal adaptation, and incorporates additional text-only data to prevent any degradation of the model’s textual abilities.

Following pre-training, the model undergoes supervised fine-tuning using a large, synthetically constructed multimodal dataset. This dataset is meticulously adapted from existing text-based question-answering datasets, with non-vocal content converted and speech synthesized using advanced text-to-speech systems. Crucially, the fine-tuning incorporates four input-output modality configurations (speech-to-speech, speech-to-text, text-to-speech, text-to-text), enabling the model to handle diverse interaction types within a unified framework.

Also Read:

Promising Results and Future Outlook

Experiments demonstrate that MOSS-Speech achieves state-of-the-art results in spoken question answering. It delivers performance comparable to existing text-guided systems in speech-to-speech tasks, all while maintaining competitive performance in text-based tasks. This indicates that the model successfully bridges the gap between text-guided and direct speech generation, offering a path to more expressive and efficient end-to-end speech interaction.

By directly understanding and generating speech, MOSS-Speech avoids the inherent limitations of text intermediates, such as latency and restricted expressivity. This work establishes a new paradigm for spoken dialogue systems, envisioning a future where human-AI interaction is seamless, multimodal, and truly speech-native across various languages and contexts.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

MOSS-Speech: A New Era for Spoken AI Dialogue

A New Paradigm for Speech Interaction

Intelligent Speech Tokenization and Training

Promising Results and Future Outlook

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Baidu Unveils Next-Generation AI Accelerators and ERNIE 5.0 Model

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates