TLDR: Fake-Mamba is a novel system for detecting synthetic speech in real-time. It utilizes a bidirectional Mamba architecture with an XLSR front-end, offering significant performance improvements and faster inference compared to existing methods like Conformer. The core innovation lies in its PN-BiMamba encoder, which effectively captures subtle deepfake cues, making it highly robust and practical for real-time anti-spoofing applications.
The rapid advancements in speech synthesis technologies, such as text-to-speech (TTS) and voice conversion (VC) systems, have made it possible to generate highly realistic artificial or modified speech. While these technologies offer benefits in areas like assistive technology and audiobooks, they also introduce significant security risks, including potential for financial fraud, legal perjury, and spoofing of voice biometric systems. This growing threat has spurred intensive research into real-time speech deepfake detection (SDD).
Traditional approaches to SDD often rely on models like Conformer, which combine convolutional neural networks (CNN) and Transformer architectures to capture both local and global features in speech. A key component of these models, Multi-Head Self-Attention (MHSA), is effective but comes with limitations. MHSA has a quadratic time complexity, meaning its computational demands increase significantly with the length of the speech sequence. This can be a major hurdle for real-time applications and memory-limited devices. Furthermore, Conformer-based methods can sometimes struggle with robustness and generalization, potentially overlooking subtle dependencies between temporal and channel dimensions, which are crucial for detecting synthetic speech artifacts.
To address these challenges, a new framework called Fake-Mamba has been proposed. This innovative solution explores the potential of Mamba, a state-space model that has recently achieved state-of-the-art performance across various domains, including language modeling and computer vision. Mamba offers compelling advantages over Conformer-based approaches, notably its near-linear time complexity and a global receptive field. Unlike MHSA, Mamba’s input-dependent selection mechanism allows for more efficient information flow by dynamically controlling feature contributions to hidden states, minimizing irrelevant influences and enhancing the detection of crucial deepfake artifacts while significantly reducing computational overhead.
Fake-Mamba is the first framework to re-architect Transformer and Conformer modules by replacing multi-head self-attention with bidirectional state-space modeling for speech deepfake detection. The system integrates an XLSR front-end, a well-established foundational model pre-trained on a vast amount of speech data, to capture rich linguistic representations. This front-end is crucial for effectively identifying the subtle cues of synthetic speech. The core innovation of Fake-Mamba lies in its introduction of three efficient encoders: TransBiMamba, ConBiMamba, and PN-BiMamba.
Among these, the PN-BiMamba variant stands out. It employs Pre-LayerNorm stabilization and bidirectional feature fusion, which are critical for localizing subtle synthetic cues. This design allows Fake-Mamba to effectively capture both local and global artifacts present in deepfake speech. The overall pipeline of Fake-Mamba involves four stages: frame-level feature extraction using XLSR, processing by the chosen BiMamba backbone, utterance-level pooling, and finally, classification to determine if the speech is human or synthetic.
Evaluations on challenging benchmarks, including ASVspoof 2021 LA, 2021 DF, and the In-The-Wild datasets, demonstrate Fake-Mamba’s superior performance. Specifically, Fake-Mamba achieved impressive Equal Error Rates (EER) of 0.97%, 1.74%, and 5.85% on these datasets, respectively. These results represent substantial relative gains over existing state-of-the-art models like XLSR-Conformer and XLSR-Mamba. For instance, Fake-Mamba(L) showed significant improvements over XLSR-Conformer across all three datasets, with performance gains of 29.71%, 23.35%, and 28.92%.
Beyond its accuracy, Fake-Mamba maintains real-time inference capabilities across various utterance lengths, making it highly practical for real-world anti-spoofing applications such as call centers, teleconferencing, and internet audio streaming services. Its hardware-friendly design contributes to consistently lower Real-Time Factors (RTFs) compared to XLSR-Conformer, indicating greater efficiency. Ablation studies further confirmed the critical role of each component within the PN-BiMamba architecture, highlighting the importance of LayerNorm layers, the Feed-Forward Network (FFN), the bidirectional structure, and linear attention pooling for optimal performance.
Also Read:
- New Dataset Uncovers Hidden Biases in Deepfake Speech Detection
- Unlocking Parallel Processing for Automatic Speech Recognition with Whisfusion
The research indicates that Mamba-based architectures are a viable and powerful alternative to traditional Transformers and Conformers for speech deepfake detection. The code for Fake-Mamba is publicly available, encouraging further research and development in this critical area of audio security. For more details, you can refer to the full research paper: Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attention’s Alternative.


