TLDR: Surg-InvNeRF is a novel AI model using an invertible Neural Radiance Field (NeRF) for highly accurate 3D tracking and reconstruction of deformable tissues in robotic surgery. It significantly improves upon existing methods by efficiently optimizing for long-term 2D and 3D point tracking, incorporating camera kinematics, and enabling detailed 3D scene representation, crucial for advanced surgical applications.
Robotic minimally invasive surgery (RMIS) has revolutionized medical procedures, but a persistent challenge remains: accurately tracking points and surfaces in 3D within the dynamic surgical environment. This capability is vital for numerous applications, from estimating forces during an operation to guiding navigation and even enabling autonomous surgical tasks. However, current methods often struggle with issues like intermittent blurriness, variable lighting, complex tissue deformations, and the loss of visual sight, making consistent 3D tracking a difficult feat.
Traditional approaches to point tracking often fall short, either providing inconsistent motion data or being limited to 2D movements. While deep learning has made strides, many solutions are constrained to short image sequences and face difficulties in disentangling camera motion from actual tissue deformation. This is where a new research paper introduces a groundbreaking solution: Surg-InvNeRF, an invertible Neural Radiance Field for 3D tracking and reconstruction in surgical vision.
The core of Surg-InvNeRF lies in its novel test-time optimization (TTO) approach, framed by a NeRF-based architecture. Unlike previous TTO methods that primarily focused on aggregating 2D correspondences, Surg-InvNeRF proposes a new invertible Neural Radiance Field (InvNeRF) architecture. This allows it to perform both 2D and, crucially, 3D tracking in surgical scenarios. The system leverages a rendering-based approach, supervising the reprojection of pixel correspondences and adapting strategies to create a bidirectional mapping between the deformable (surgical scene) and canonical (static) spaces. This enables efficient handling of a defined workspace and guides the density of the ‘rays’ that build the 3D representation.
Key Innovations of Surg-InvNeRF
The researchers highlight several key contributions that make Surg-InvNeRF stand out:
- Long-Term 3D Point Tracking: It proposes a novel TTO approach for long-term 3D point tracking by combining stereo depth estimation with reliable 2D short-term correspondences.
- Efficient Pixel Sampling: A new algorithm is introduced to avoid redundant pixel sampling, leading to faster optimization. It tracks a ‘light error map’ to identify areas where the model needs to focus its attention, significantly reducing training time.
- Joint Loss Function: A sophisticated loss function is designed to exploit different subspaces—the 2D image plane, the deformable 3D workspace, and the static canonical 3D space—minimizing projection and re-projection errors across all of them.
- Integration of Semantic Information: The system can incorporate tool masking, identifying and segmenting out surgical instruments. This allows the optimization process to focus solely on the target tissue, simplifying the task. It also disentangles camera motion from tissue deformation by using kinematic data from the robotic system.
- Multi-scale HexPlanes (MHP): For fast inference, the paper introduces multi-scale HexPlanes, which efficiently encode density and RGB values, allowing the model to capture fine details while reducing the number of parameters.
How it Works
Surg-InvNeRF takes pairs of images from different time steps, along with camera parameters and short-term pixel correspondences. It calculates rays from these points and uses stereo correspondences to triangulate 3D points, which then guide the density of the rays during optimization. The invertible neural network performs a bidirectional mapping between the deformable workspace and a static canonical space. This means points from one time step can be mapped to the canonical space and then transformed to another time step, ensuring consistency and allowing for long-term tracking. The system optimizes by minimizing errors across the image planes, the deformable workspace, and the canonical space.
Performance and Results
The model was rigorously evaluated on the STIR (Surgical Tattoos in Infrared) and SCARED (Stereo Correspondence and Reconstruction of Endoscopic Data) datasets. Results show that Surg-InvNeRF significantly outperforms other state-of-the-art test-time optimization methods for 2D tracking, achieving nearly a 50% improvement in average precision. It is also the first TTO approach to successfully tackle 3D point tracking, surpassing feed-forward methods while incorporating the benefits of deformable NeRF-based reconstruction.
The efficient pixel sampling algorithm demonstrated faster convergence and better utilization of precomputed correspondences. The inclusion of tool masking further improved 2D tracking results. Furthermore, Surg-InvNeRF recovers the benefits of render-based approaches for image rendering, providing accurate and metric estimations of object locations, unlike some previous methods that lacked real-world scaling. The model also showed consistent depth estimation when tested with camera motion data from the SCARED dataset.
Also Read:
- NeeCo: Advancing Surgical AI with Synthetic Image Generation for Instrument States
- Automated Spinal Imaging: A Robotic Ultrasound and AI-Powered Reconstruction System
Conclusion
Surg-InvNeRF marks a significant advancement in surgical vision. It is the first NeRF-based test-time optimization method to simultaneously address 2D and 3D tracking, and the first to introduce an algorithm for efficient pixel sampling in TTO. By encoding representations from structure, color, and 3D flow of the object, it provides accurate point tracking and related depth estimation, making it a powerful tool for future surgical applications. The researchers believe that any future improvements in 2D correspondences will further enhance the representations created by their InvNeRF method. For more details, you can refer to the full research paper.


