TLDR: A new self-supervised framework, GD-EMoDE, is proposed for monocular depth estimation in diverse endoscopic scenes. It uses a novel block-wise mixture of dynamic low-rank experts for efficient finetuning of foundation models and a self-supervised training framework to handle brightness and reflectance inconsistencies. The method achieves state-of-the-art performance and generalization on various endoscopic datasets, improving 3D perception for minimally invasive surgery, though it currently has limitations in inference speed and computational cost.
In the realm of minimally invasive surgery, endoscopy plays a crucial role, allowing medical professionals to perform procedures with reduced trauma and faster recovery times. However, the limited field of view and two-dimensional nature of traditional endoscopes make it challenging to perceive the three-dimensional structure of internal scenes. This is where depth estimation comes in, providing vital 3D information for tasks like surgical navigation and robotic tissue manipulation.
A recent research paper, titled “Generalizable Self-supervised Monocular Depth Estimation with Mixture of Low-Rank Experts for Diverse Endoscopic Scenes,” introduces a new framework designed to overcome the significant challenges of depth estimation in varied endoscopic environments. The paper, authored by Liangjing Shao, Benshuang Chen, Chenkang Du, Xueli Liu, and Xinrong Chen, addresses issues such as inconsistent lighting and the wide variety of tissue features encountered during endoscopic procedures. You can find the full research paper here.
Addressing Key Challenges in Endoscopic Depth Estimation
Current methods for self-supervised monocular depth estimation, while effective in natural scenes, often fall short in endoscopy. The primary hurdles include the dramatic variations in illumination, leading to brightness and reflectance inconsistencies, and the diverse visual features of different tissues and surgical tasks. These factors severely limit the accuracy and generalizability of existing depth estimation models.
The researchers propose a novel self-supervised framework, named GD-EMoDE, which tackles these problems head-on. It integrates two main innovations: a new parameter-efficient finetuning method and a specialized self-supervised training framework.
Block-wise Mixture of Low-Rank Experts (BW-MoLE)
One of the core components of GD-EMoDE is the Block-wise Mixture of Low-Rank Experts (BW-MoLE). This method efficiently adapts a pre-trained “foundation model” for depth estimation to the specific demands of endoscopic scenes. Unlike previous finetuning approaches that might struggle with feature diversity, BW-MoLE uses a dynamic system where different “experts” are adaptively selected based on the input features. These experts, each with a small number of trainable parameters, are allocated to different parts of the model based on how well each part generalizes during training. This intelligent allocation helps the model adapt more effectively to the wide range of visual information in endoscopy.
A Novel Self-supervised Training Framework
To combat the issues of brightness and reflectance, the paper introduces a new self-supervised training framework. This framework jointly handles illumination inconsistencies and light interference. It includes an intrinsic image decomposition network that separates an image into its inherent color (albedo) and lighting conditions (shading). This separation helps the model understand the true depth of objects without being misled by bright spots or shadows. The training process is divided into multiple stages, ensuring that different aspects of the model are optimized effectively.
Superior Performance and Generalization
The GD-EMoDE framework has been rigorously tested on both realistic and simulated endoscopic datasets, including SCARED, SimCol, C3VD, Hamlyn, and SERV-CT. The results demonstrate that the proposed method consistently outperforms state-of-the-art techniques, showing lower error rates and higher accuracy. Crucially, it also exhibits superior generalization capabilities, meaning it performs well even on new, unseen endoscopic scenes without additional training (zero-shot depth estimation).
Beyond depth estimation, the framework also contributes to more accurate 3D reconstruction and ego-motion estimation, which are vital for surgical planning and execution. The reconstructed 3D scenes are clearer and more accurate, and the ego-motion estimation (understanding the camera’s movement) is also improved.
Also Read:
- Advancing 3D Vision with Geometric Deep Learning for Enhanced Perception and Reconstruction
- Advancing Catheterization with a New Vision Transformer Model
Future Directions
While GD-EMoDE marks a significant advancement, the authors acknowledge certain limitations. The current inference speed, around 20-30 frames per second, and the high computational cost for training (approximately 23GB of GPU memory) are areas identified for future development. Addressing these aspects will further enhance the practical applicability of this promising technology in clinical settings.


