TLDR: TransForSeg is a novel AI model using Vision Transformers for medical catheterization. It simultaneously performs stereo segmentation (localizing the catheter in X-ray images) and 3D force estimation (predicting pressure at the catheter tip). This multitask approach provides both visual and tactile feedback from X-ray images, eliminating the need for physical sensors and outperforming existing methods in accuracy and efficiency, even under noisy conditions. The model’s design, featuring shared weights and cross-attention, makes it lightweight and robust for real-time applications.
Catheterization procedures are vital in modern medicine, allowing surgeons to navigate the cardiovascular system with precision for diagnostics and interventions. However, these delicate procedures require both visual and tactile feedback to ensure safety and accuracy. Surgeons traditionally rely on their haptic perception to avoid applying excessive pressure, while visual feedback helps in precise navigation through complex vascular pathways.
The challenge arises because most standard catheters lack integrated force sensors or micro-cameras at their tips, primarily due to cost. To bridge this gap, deep learning models have emerged, aiming to extract both visual and tactile information directly from X-ray images. These data-driven approaches can infer contact forces and catheter positioning, reducing the reliance on physical sensors.
Existing deep learning methods for this task often fall into categories: 2D or 3D force estimators, and semantic segmentation models for catheter localization. More recently, multitask models have combined both segmentation and force estimation into a single framework, improving efficiency by eliminating the need for separate hardware or two-stage processing.
However, many current models are based on Convolutional Neural Networks (CNNs), which progressively expand their receptive fields through image downsampling. While effective, there was an unexplored potential for Vision Transformer (ViT) models, especially for stereo segmentation in this application.
Introducing TransForSeg: A Novel Approach
This is where TransForSeg comes in. Proposed by Pedram Fekri, Mehrdad Zadeh, and Javad Dargahi, TransForSeg is a novel multitask encoder-decoder Vision Transformer architecture designed for simultaneous stereo catheter segmentation and 3D force estimation from X-ray images. It processes two input X-ray images, capturing long-range dependencies without the need for gradual receptive field expansion.
The model’s innovative design includes a transformer encoder and decoder that receive patch sequences from two X-ray images concurrently. These patches are projected into rich embeddings that capture the global context of the images. The embeddings are then fed into two shared segmentation heads to generate segmentation maps, while a regression head uses the fused information from the decoder for 3D force estimation.
A key aspect of TransForSeg is its computational efficiency. The ViT decoder shares its weights with the ViT encoder, effectively mirroring its structure. Additionally, the CNN-based upsampling head, used to reconstruct the segmentation maps, is shared between the encoder and decoder, further reducing model complexity and parameter count.
Key Advantages and Performance
TransForSeg offers several significant advantages:
- It can estimate contact forces directly from X-ray images, with the segmentation task guiding the network to focus on the catheter’s deflection shape rather than background variations.
- The shared weights and cross-attention mechanism enhance computational efficiency and enable accurate 3D contact force prediction by fusing tokens from X-ray images at both angles.
- It processes two input X-ray images and produces three outputs across two modalities: two segmentation maps and a force vector predicting contact forces along the x, y, and z axes.
Extensive experiments on synthetic X-ray images, including those with various noise levels, demonstrated that TransForSeg consistently outperforms existing state-of-the-art models in both catheter segmentation and 3D force estimation. For instance, it achieved significant MSE improvements in force estimation compared to previous multitask models, such as H-Net, across different datasets.
An ablation study confirmed the crucial role of the segmentation heads, especially in complex X-ray images, where they improved force estimation precision by helping the model focus on catheter deflections. While the model showed some sensitivity to certain noise types like Gaussian, Motion blur, and Defocus, it maintained robust performance on X-Ray1 and X-Ray2 datasets under Stripe, Poisson, and Impulse noise, showcasing its generalization capabilities.
Also Read:
- Navigating Noisy Labels in Medical Imaging: A Dual-Guided Framework for Robust Segmentation
- Advancing 3D Vision with Geometric Deep Learning for Enhanced Perception and Reconstruction
Conclusion
TransForSeg represents a significant advancement in sensor-free, learning-based 3D catheter force estimation and segmentation. Its lightweight and generalizable architecture makes it well-suited for real-time deployment in catheter-based interventions, potentially enhancing safety and precision for both human surgeons and autonomous robotic systems. Future work aims to adapt it for real-world clinical settings and integrate it with robotic platforms.


