TLDR: FaRMamba is a new deep learning model for medical image segmentation that enhances Vision Mamba’s capabilities. It addresses common challenges like blurred boundaries and lost details by integrating two key modules: a Multi-Scale Frequency Transform Module (MSFM) that restores high-frequency information using various transforms (DWT, FFT, DCT) tailored to different image modalities, and a Self-Supervised Reconstruction Auxiliary Encoder (SSRAE) that recovers 2D spatial correlations through pixel-level reconstruction. This dual approach allows FaRMamba to achieve superior accuracy and detail preservation in medical image segmentation across diverse datasets like ultrasound, MRI, and endoscopy.
Medical image segmentation is a crucial task in healthcare, aiding in everything from tumor detection to organ recognition and surgical planning. However, it faces significant challenges: blurred lesion boundaries, the loss of fine high-frequency details, and difficulty in accurately modeling long-range anatomical structures within images.
Traditional methods like Convolutional Neural Networks (CNNs) are good at capturing local details but struggle with global context. Vision Transformers (ViTs), on the other hand, excel at global dependencies but can lose local pixel adjacency and fine details due to their patch-based approach. More recently, Vision Mamba models have emerged as a promising solution, efficiently modeling global dependencies with linear computational complexity, making them scalable for large medical images.
Despite their strengths, Vision Mamba models have their own limitations in medical imaging. Their method of breaking images into patches and processing them as one-dimensional sequences can disrupt local pixel relationships and act like a low-pass filter, leading to a deficiency in capturing local high-frequency information and a degradation of two-dimensional spatial structures. These issues can worsen the problems of blurred boundaries and lost high-frequency details.
To address these critical shortcomings, researchers have proposed FaRMamba, a novel extension to Vision Mamba. FaRMamba introduces two complementary modules designed to explicitly tackle the challenges of lost high-frequency details and degraded 2D spatial structures.
Multi-Scale Frequency Transform Module (MSFM)
The first module, MSFM, focuses on restoring the high-frequency information that often gets lost. It does this by transforming spatial image features into the frequency domain and then analyzing information across multiple spectral bands. FaRMamba explores three different frequency transforms within this module: Discrete Wavelet Transform (DWT), Fast Fourier Transform (FFT), and Discrete Cosine Transform (DCT). The choice of transform can be tailored to the specific type of medical image, as each has unique strengths. For instance, DWT is particularly effective for noisy ultrasound images, FFT aligns well with MRI’s native data structure, and DCT is best suited for the textured patterns found in endoscopic images.
Also Read:
- Flow-SSNs: Advancing Medical Image Segmentation with Enhanced Uncertainty Modeling
- Enhancing Medical Image Clarity with Dual-Pathway Learning
Self-Supervised Reconstruction Auxiliary Encoder (SSRAE)
The second module, SSRAE, aims to recover the full two-dimensional spatial correlations that can be disrupted by Mamba’s one-dimensional processing. This module enforces pixel-level reconstruction on the shared Mamba encoder. By training the model to precisely restore degraded versions of the input images, SSRAE encourages the encoder to learn spatially coherent representations, which in turn enhances both fine textures and the overall global context of the image. This self-supervised approach helps the model understand and preserve geometric details and boundary fidelity.
FaRMamba combines these two modules with a joint loss function that dynamically adjusts during training, ensuring that both segmentation accuracy and reconstruction quality are optimized.
Extensive evaluations of FaRMamba were conducted on diverse medical datasets, including CAMUS echocardiography, MRI-based Mouse-cochlea, and Kvasir-Seg endoscopy. The results consistently showed that FaRMamba outperforms competitive CNN-Transformer hybrids and existing Mamba variants. It delivered superior boundary accuracy, better detail preservation, and improved global coherence without adding excessive computational burden.
This work represents a significant step forward, providing a flexible, frequency-aware framework for future medical image segmentation models that directly mitigates core challenges in the field. For more in-depth information, you can read the full research paper available at https://arxiv.org/pdf/2507.20056.


