TLDR: SMoEStereo is a new AI framework that significantly improves robust stereo matching by adaptively fine-tuning Vision Foundation Models (VFMs). It uses a selective Mixture-of-Experts (MoE) approach with adaptive Low-Rank Adaptation (LoRA) and Adapter layers, along with a lightweight decision network, to dynamically select optimal components for varying scene complexities. This enables state-of-the-art cross-domain and joint generalization performance across diverse real-world datasets with high efficiency and minimal learnable parameters.
In the rapidly evolving field of computer vision, stereo matching – the process of identifying pixel-wise correspondences between two images to determine depth – is crucial for applications like autonomous driving, robot navigation, and augmented reality. While recent advancements in learning-based stereo matching have shown impressive results on controlled benchmarks, their performance often falters in real-world scenarios due to significant variations in scenes and imbalanced disparity distributions across different datasets. This challenge, known as domain shift, leads to less robust and often distorted depth estimations.
Addressing Real-World Challenges with Vision Foundation Models
A promising avenue to enhance the robustness of stereo matching lies in leveraging Vision Foundation Models (VFMs). These powerful models, such as DepthAnythingV2 for monocular depth estimation and SegmentAnything for segmentation, are pre-trained on vast and diverse datasets. They are excellent at extracting general-purpose deep features, which intuitively should improve robustness. However, directly applying these VFMs to stereo matching tasks has shown limited success in zero-shot performance, meaning they struggle with entirely new, unseen environments without specific training.
Furthermore, existing fine-tuning methods for VFMs, like Low-Rank Adaptation (LoRA), often use a fixed approach that doesn’t adapt well to the varying complexities of real-world stereo scenes. They treat all inputs uniformly, which limits their ability to dynamically adjust to scene-specific characteristics, leading to suboptimal generalization.
Introducing SMoEStereo: Adaptive and Efficient Depth Perception
To overcome these limitations, researchers have introduced SMoEStereo, a novel framework designed to adapt VFMs for robust stereo matching. SMoEStereo employs a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules. This innovative approach allows the model to dynamically select the most optimal experts based on the input scene, ensuring adaptability across diverse domains.
The framework introduces two key components: MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes. MoE-LoRA dynamically selects optimal experts within the Mixture-of-Experts to adapt to varying scenes across different domains. MoE-Adapter, on the other hand, injects inductive bias into the frozen VFMs, which is essential for improving the extraction of geometric features. This hybrid design combines the strengths of CNNs (for fine-grained local details) and LoRA (for long-range interactions), significantly reducing stereo matching errors compared to traditional VFM-LoRA baselines.
Balancing Efficiency and Accuracy with a Lightweight Decision Network
A critical aspect of SMoEStereo is its lightweight decision network. Integrating MoE modules into all Vision Transformer (ViT) blocks can introduce computational overhead. The decision network addresses this by selectively activating MoE modules based on the input complexity. For simpler samples, it discards redundant modules, while for complex ones, it utilizes more, striking a balance between efficiency and accuracy. This network is jointly optimized with the MoE modules, incorporating a usage loss to manage computational costs and encourage policies that reduce redundancy without sacrificing performance.
Also Read:
- SingLoRA: A Streamlined Approach to Stable and Efficient Model Fine-Tuning
- A Unified Approach to 3D Point Cloud Segmentation Using AI Descriptions and Images
State-of-the-Art Performance Across Diverse Benchmarks
Extensive experiments demonstrate that SMoEStereo achieves state-of-the-art cross-domain and joint generalization across multiple benchmarks, including KITTI, Middlebury, ETH3D, and DrivingStereo, without requiring dataset-specific adaptation. It significantly outperforms previous domain-generalized methods and other parameter-efficient fine-tuning techniques, often with fewer parameters and faster inference times. The framework’s versatility is also highlighted by its remarkable performance with various VFM backbones like DAM, SAM, and DINOV2.
The dynamic expert selection mechanism of SMoEStereo is particularly effective, as different datasets exhibit distinct optimal LoRA and Adapter expert selection distributions. This empirical validation underscores SMoEStereo’s flexible adaptability, which is crucial for robust cross-domain generalization in real-world deployments.
For more technical details, you can refer to the full research paper available here.


