TLDR: BSMamba2 is a new music source separation model that uses Mamba2, a state space model, to effectively isolate vocals, especially intermittently occurring ones. It outperforms previous state-of-the-art models, achieving a cSDR of 11.03 dB, and demonstrates stable performance across various input lengths and vocal patterns, proving Mamba-based models’ efficiency for high-resolution audio.
Music source separation, the art of isolating individual components like vocals, drums, or bass from a mixed song, is a crucial task with wide applications in remixing, music information retrieval, and education. Among these, vocal separation stands out as particularly challenging, especially given the high sampling rates required for quality audio (44.1 kHz).
Recent advancements in this field have seen models like HT Demucs and BS-RoFormer achieve impressive results. However, a persistent challenge for Transformer-based models, such as BS-RoFormer, has been their struggle with vocals that appear intermittently. These models, relying on global attention, tend to distribute their focus uniformly across an entire sequence, often failing to adequately emphasize sparse but critical vocal segments.
Addressing this limitation, researchers Euiyeon Kim and Yong-Hoon Choi from Kwangwoon University have introduced a novel model called BSMamba2. This new approach leverages Mamba2, a cutting-edge state space model, to enhance the capture of long-range temporal dependencies in audio. Mamba2 is particularly adept at handling sequences with sparse events due to its selective state updates, which allow it to inject information strongly at important moments while suppressing irrelevant ones.
The BSMamba2 architecture builds upon the successful band-splitting strategy and dual-path processing seen in previous models like BS-RoFormer. The band-splitting module divides the audio spectrogram into multiple frequency sub-bands, processing each independently before combining them. The dual-path module then models dependencies along both time and sub-band axes using bidirectional Mamba2 blocks, allowing for a comprehensive understanding of the audio structure.
Experiments conducted on the MUSDB18HQ dataset demonstrate BSMamba2’s significant leap in performance. It achieved a chunk-level Signal-to-Distortion Ratio (cSDR) of 11.03 dB, marking the best reported performance to date. This not only surpasses previous state-of-the-art models like SCNet-large but also shows substantial improvements in utterance-level SDR (uSDR).
A key finding from the research is BSMamba2’s robust and consistent performance across varying input lengths and vocal occurrence patterns. Unlike BS-RoFormer, which saw its performance degrade significantly when vocals appeared intermittently or with very short durations, BSMamba2 maintained high separation quality. For instance, the performance gap was largest for short vocal segments (1-2 seconds), where BSMamba2 outperformed BS-RoFormer by 1.15 dB.
Also Read:
- Advancing Audio Understanding with Multi-Hypothesis Self-Supervised Learning
- The Hidden Challenge of Noisy Data in Speech Separation
Furthermore, BSMamba2 achieves these superior results with fewer parameters than BS-RoFormer (48.1M vs 72.2M), highlighting its efficiency. This work underscores the effectiveness of Mamba-based models for high-resolution audio processing and opens new avenues for broader applications in audio research. For a deeper dive into the technical details, you can read the full research paper here.


