TLDR: Schr ¨odinger Bridge Mamba (SBM) is a new framework for speech enhancement that combines the Schr ¨odinger Bridge training method with the efficient Mamba neural network architecture. It achieves superior speech quality and significantly faster, one-step inference compared to existing methods, making it highly efficient for real-time applications.
A groundbreaking new approach to speech enhancement, dubbed Schr ¨odinger Bridge Mamba (SBM), has been introduced, promising high-quality audio restoration with unprecedented efficiency. Developed by researchers Jing Yang, Sirui Wang, Chao Wu, and Fan Fan from the Central Media Technology Institute at Huawei, SBM combines two powerful concepts: the Schr ¨odinger Bridge (SB) training paradigm and the selective state-space model Mamba.
Speech enhancement, a critical task in audio processing, aims to remove unwanted noise and reverberation from degraded speech, producing clear, high-quality audio. While deep generative models have shown great promise in this area, a significant challenge has been their slow inference process, often requiring many computational steps to generate the enhanced output. This limitation has hindered their application in real-time scenarios or on devices with limited resources.
The SBM framework addresses this bottleneck by leveraging the inherent compatibility between the Schr ¨odinger Bridge and Mamba architectures. The Schr ¨odinger Bridge paradigm is a theoretically sound method for modeling the optimal path between degraded and clean speech distributions using stochastic differential equations. Mamba, on the other hand, is a recently developed selective state-space model known for its efficiency and ability to capture long-range dependencies in sequential data, making it ideal for audio signals.
The core innovation of SBM lies in training a Mamba-based backbone model using the SB paradigm. This integration allows the model to “distill” the complex SB transformation into the efficient state-space dynamics of the Mamba architecture. The result is a generative model capable of producing high-quality clean speech in just a single inference step, a significant improvement over traditional SB-based methods that often require ten or more iterative steps.
Experiments conducted on a joint denoising and dereverberation task across four benchmark datasets demonstrated SBM’s superior performance. It consistently outperformed strong baselines, including conventional SB models (like SB-NCSN++) and other one-step SB variants (SBCTM, SB-UFOGen), as well as Mamba-based models trained with traditional predictive mapping. Notably, SBM achieved the best real-time factor (RTF), indicating its exceptional efficiency, while maintaining a comparably small model size.
The researchers highlight that SBM’s success stems from aligning the training paradigm with the backbone architecture based on their underlying compatibility. This synergy not only enhances the performance of the Mamba backbone but also accelerates the inference of SB-framed models. The implications extend beyond speech enhancement, suggesting a promising direction for developing new deep generative models applicable to a wide range of tasks, including image, video, and multimodal generation.
Also Read:
- Dynamic Learning: Enhancing Acoustic Scene Classification for Unseen Devices
- Efficient One-Step Generation with Di-Bregman Diffusion Distillation
For more technical details, you can refer to the original research paper.


