TLDR: This research paper proposes a clinically grounded methodology and software framework for evaluating interactive medical image segmentation algorithms. It addresses current validation pitfalls, such as inconsistent input representation, neglect of challenging tasks, and inadequate metrics. By evaluating state-of-the-art algorithms across diverse and complex tasks, the study highlights the importance of minimizing information loss, using adaptive zooming strategies for robustness and rapid convergence, and considering the impact of differing prompting behaviors. It also identifies strengths and weaknesses of 2D versus 3D methods and non-medical domain models, emphasizing that volumetric context is crucial for large or irregularly shaped targets.
Interactive segmentation is a powerful technique used in medical imaging to help delineate anatomical structures and pathologies. This process is crucial for tasks like treatment planning, patient monitoring, and guided therapies. While fully automatic segmentation algorithms have made great strides, they often struggle with fine structures, diverse targets, or when annotated data is scarce. Interactive methods, which incorporate user input to guide and refine segmentations, offer a promising solution to these challenges by reducing reliance on purely image-derived features.
However, the way these interactive segmentation algorithms are currently evaluated often falls short. Inconsistent and clinically unrealistic evaluation practices can hinder fair comparisons between different methods and misrepresent their true performance in real-world clinical settings. This research paper introduces a new, clinically grounded methodology designed to define evaluation tasks and metrics more accurately. It also presents a software framework that allows for the construction of standardized evaluation pipelines.
Addressing Key Validation Challenges
The authors highlight several critical pitfalls in existing validation approaches. Many experiments fail to standardize how user inputs, prompts, and outputs are represented, often introducing pre-processing steps like resampling images to model-specific resolutions or simulating prompts on restrictive sub-regions. This can lead to misleading efficiency metrics, as the reported annotation effort might not reflect actual clinical deployment. The paper strongly recommends performing evaluations in the original image space to better reflect real-world scenarios.
Another issue is the focus on segmentation tasks that are already easily handled by automated methods, neglecting the more challenging cases clinicians face. These include targets with ambiguous boundaries (like tumors), very small structures (such as white matter lesions), geometrically complex or topologically constrained networks (like vascular systems), and multi-target segmentations where simple label merging isn’t sufficient. The paper advocates for prioritizing these difficult tasks and assessing multi-target segmentation more rigorously.
Furthermore, many studies overlook multi-modal or multi-sequence image data, which often provides essential clinical context. The proposed validation pipelines aim to include such data where applicable. The paper also points out that prompting methods (points, scribbles, boxes) vary in effectiveness and effort. For fair comparisons, evaluations should use prompt configurations supported by all algorithms being compared, and user effort should be estimated not just by interaction count but also by prompt placement effort, ideally through user studies.
Comprehensive Metrics for Better Assessment
To provide a more complete picture of performance, the research emphasizes the importance of reporting complementary metrics. Instead of relying solely on overlap-based metrics like Dice, which can misrepresent performance on small or complex structures, evaluations should also include boundary-aware metrics such as normalized surface Dice (NSD). Capturing performance variability during iterative refinement is also crucial. Metrics like Dice and NSD area under the curve (AUC), normalized by interaction count, can measure convergence speed and stability. The authors also suggest considering larger interaction budgets to identify emergent behaviors and using clinical criteria or specialist automatic baselines to determine when segmentation is complete.
A New Framework for Evaluation
The core of this work is a modular framework that separates the generation of segmentation requests (image patch, prompts, task description) and metric computation from the inference algorithm itself. This framework integrates a task selection pipeline. Algorithms are characterized by “fingerprints” that detail their capabilities, such as whether they adapt over repeated tasks, supported inference modes, segmentation subtypes, training specificity, prompt compatibility, image patch configurations, and modalities seen during training. These fingerprints are then cross-referenced with candidate tasks to select compatible experiments.
The framework integrated and evaluated several state-of-the-art algorithms, including SAM2, SAM-Med2D, SAM-Med3D, and SegVol. These models were adapted to be compatible with the framework’s segmentation request definition, with careful attention paid to how images and prompt coordinates were handled for different model architectures (2D vs. 3D).
Key Experimental Findings
The evaluation focused on challenges vital for clinical deployment, conducting assessments in native image spaces across four axes of algorithmic complexity: voxel count (volume size), image spacing/anisotropy, target geometry (spherical vs. irregular), and target size variation. Tasks were chosen from the Medical Segmentation Decathlon, including hippocampus, brain tumor core, pancreas, prostate, and lung lesion.
The experiments revealed several important insights:
- Minimizing information loss during prompt processing and using adaptive zooming strategies are critical for robustness across varying volume and target sizes.
- Adaptive zooming mechanisms also lead to faster convergence.
- Performance can degrade significantly if the prompting behavior or interaction budgets used during validation differ from those used during training.
- 2D methods perform well on slab-like images and coarse targets, but 3D context is highly beneficial for large or irregularly shaped targets.
- Non-medical domain models, such as SAM2, can perform well but may struggle with tissue-ambiguous targets (like brain tumors) when using simple point prompts.
For instance, in tasks with large volumes like the pancreas, SegVol consistently outperformed other methods, demonstrating rapid and consistent convergence due to its zoom-out zoom-in mechanism. For highly anisotropic images (like the prostate), 2D methods improved rapidly after a few interactions, eventually surpassing 3D methods in Dice and NSD scores. However, for isotropic images with irregular targets (brain tumor core), volumetric context proved key for rapid and consistent convergence.
Also Read:
- Intelligent Agents Reshape Radiology Workflows
- New Memory System Enables Smarter, More Adaptable GUI Agents
Looking Ahead
This research provides a robust framework and valuable insights into evaluating interactive segmentation algorithms. Future work will explore the stability of metrics, expand the range of integrated algorithms, and broaden evaluation tasks. The framework also paves the way for future user studies to design more realistic prompting simulations and assess prompt placement time across different prompt types and target geometries, which are crucial for accurate effort estimation. A critical future step will also be to ensure that validation datasets were not used for model pre-training, a common limitation in works building on foundation models.
For more detailed information, you can read the full research paper here.


