TLDR: The paper introduces MARIS, the first large-scale, fine-grained benchmark for open-vocabulary underwater instance segmentation. It also proposes a novel framework with two modules: the Geometric Prior Enhancement Module (GPEM) to handle visual degradation and the Semantic Alignment Injection Mechanism (SAIM) to address semantic ambiguity. This framework significantly improves object recognition and segmentation in challenging underwater environments, outperforming existing methods.
Underwater environments, with their unique visual challenges like color attenuation, low contrast, and light scattering, pose significant hurdles for artificial intelligence systems designed to identify and segment objects. Traditional methods for underwater instance segmentation, which involve precisely outlining and categorizing every object in an image, have been limited by a restricted vocabulary of recognizable marine species and a scarcity of detailed annotated data. This means they often struggle to identify new or fine-grained marine categories, which is crucial for applications like marine biodiversity monitoring and autonomous underwater vehicles.
A new research paper introduces a groundbreaking solution to these challenges: MARIS (Marine Open-Vocabulary Instance Segmentation). This work not only presents the first large-scale, fine-grained benchmark dataset for open-vocabulary segmentation in underwater settings but also proposes a novel framework designed to overcome the inherent difficulties of underwater imagery.
The MARIS Dataset: A New Standard for Underwater Data
One of the primary contributions of this research is the MARIS dataset itself. Existing underwater datasets typically contain fewer than 20 annotated categories, often grouping diverse organisms into broad classes like “fish” or “plants.” This coarse labeling severely restricts the ability of AI models to generalize to unseen or highly specific marine species. To address this, MARIS was meticulously curated from multiple sources, re-annotated, and expanded to include over 16,000 underwater images categorized into 9 super-classes and 158 fine-grained subclasses. For instance, the “fish” super-class is refined into 76 distinct species. All annotations are provided at the instance level with pixel-accurate masks, making MARIS the first benchmark to support rigorous evaluation of open-vocabulary instance segmentation in underwater environments.
A Unified Framework: GPEM and SAIM
Beyond the dataset, the researchers propose a unified framework with two complementary components to tackle the core issues of visual degradation and semantic ambiguity in underwater images:
- Geometric Prior Enhancement Module (GPEM): Underwater images suffer from severe visual degradation, making visual appearance cues unstable. However, many underwater objects retain stable geometric properties (e.g., body shapes, fin structures). The GPEM leverages these stable part-level and structural cues to maintain object consistency even under degraded visual conditions. It fuses multi-scale visual features with depth-derived geometric priors, enhancing representations with crucial structural information.
- Semantic Alignment Injection Mechanism (SAIM): Current vision-language models (VLMs), primarily trained on terrestrial data, often fail to capture the fine-grained semantics specific to underwater environments. This leads to semantic ambiguity. The SAIM enriches language embeddings with domain-specific priors by introducing “underwater prompts.” These prompts encode five complementary aspects of underwater scenes: environmental context, water medium and visibility, illumination and perception, depth cues, and scene interactions. By guiding the model with these enriched underwater semantics, SAIM mitigates category ambiguity and significantly improves the recognition of unseen categories.
Also Read:
- Improving Multimodal AI: Understanding How Modalities Work Together
- Decoding the Past: ClapperText and Low-Resource Text Recognition
Performance and Impact
Experiments conducted on the MARIS dataset demonstrate that this new framework consistently outperforms existing open-vocabulary segmentation baselines. This holds true for both “in-domain” evaluations (models trained and tested on MARIS) and “cross-domain” evaluations (models trained on a generic dataset like COCO and tested on MARIS). The framework shows notable gains in accuracy and robustness, particularly for more precise mask predictions.
The research also highlights the efficiency of the proposed method, achieving higher accuracy while maintaining lower computational complexity and significantly fewer trainable parameters compared to previous approaches. While the model generally performs better in in-domain settings, it also shows effective cross-domain recognition, especially for objects that appear in both natural and underwater scenes (like a “plastic bag”).
In conclusion, the introduction of the MARIS dataset and the proposed GPEM and SAIM framework establish a strong foundation for future underwater perception research. This work paves the way for more accurate and adaptable AI systems that can better understand and interact with the complex and visually challenging marine world. For more details, you can read the full research paper here.


