TLDR: This research compares the performance of YOLOv8, YOLOv9, YOLOv10, and YOLOv11 models on two underwater datasets (coral disease and fish species). It finds that while accuracy improvements are minimal after YOLOv9 in marine environments, inference speed significantly improves across versions, with YOLOv10 offering the best speed-accuracy balance for autonomous underwater vehicles. The study also highlights challenges in interpreting these complex models using Grad-CAM.
Autonomous underwater vehicles (AUVs) are becoming increasingly vital for tasks like mapping marine habitats, monitoring ecosystems, and inspecting underwater infrastructure. These vehicles rely heavily on computer vision systems to understand their surroundings. However, the underwater environment presents unique challenges for these systems, including poor lighting, murky water, and often small, densely packed objects like marine organisms. Additionally, AUVs have limited computational power, making efficient computer vision models crucial.
Understanding the Challenges of Underwater Vision
Traditional computer vision models, especially two-stage detectors, can be too slow and computationally demanding for real-time deployment on AUVs. This is where the YOLO (You Only Look Once) family of models comes in. YOLO models are known for their ability to combine object localization and classification into a single, fast network, making them ideal for time-sensitive applications like autonomous navigation.
While YOLO models have shown impressive performance on land-based benchmarks like COCO and PASCAL VOC, their effectiveness in the marine domain has been less explored. The significant differences between terrestrial and underwater imagery mean that performance on one doesn’t necessarily translate to the other. This research paper addresses this gap by providing a controlled comparison of recent YOLO versions in underwater settings.
The Study’s Approach
Researchers curated two publicly available datasets to evaluate YOLOv8-s, YOLOv9-s, YOLOv10-s, and YOLOv11-s. The first, a Coral Disease dataset, contained 4,480 images across 18 classes, while the second, a Fish Species dataset, had 7,500 images with 20 distinct classes. To understand how data availability affects performance, models were trained using 25%, 50%, 75%, and 100% of the training images, while validation and test sets remained consistent.
All models were trained with identical settings (100 epochs, 640 px input, batch size 16, on a T4 GPU) and evaluated using standard metrics such as precision, recall, mAP50, mAP50-95 (measures of accuracy), and per-image inference time and frames-per-second (FPS) to assess speed. The study also utilized Grad-CAM visualizations to understand which features the models were focusing on during their predictions.
Each YOLO version introduces architectural innovations. YOLOv8 moved to an anchor-free head and decoupled regression and classification tasks for better small object detection. YOLOv9 introduced a hybrid detection head and a Generalized Efficient Layer Aggregation Network (GELAN) for improved accuracy and efficiency, along with a Dynamic Receptive Field Selection (DRS) Block to handle closely packed objects. YOLOv10 is a lightweight model optimized for edge devices, using Neural Architecture Search (NAS) and an improved C3 module. YOLOv11, the latest, incorporates CNN-based backbones with attention mechanisms and a dynamic scaling mechanism for optimal width and depth, along with lightweight transformers in its neck design for faster context understanding.
Key Findings: Accuracy vs. Speed
The study revealed interesting trends. Across both the Coral Disease and Fish Species datasets, the accuracy of the YOLO models, as measured by mAP50 and mAP50-95, tended to saturate after YOLOv9. This suggests that while newer versions introduce architectural innovations, these primarily target efficiency rather than significant accuracy gains in marine environments. In many cases, YOLOv8 and YOLOv9 even achieved comparable or slightly better accuracy than YOLOv10 and YOLOv11, especially with varying amounts of training data.
However, inference speed showed a marked improvement across successive YOLO versions. YOLOv8 was the slowest, while YOLOv10 consistently demonstrated the best inference speed, often outperforming YOLOv11. This indicates that YOLOv10 offers the most favorable speed-accuracy trade-off, making it particularly suitable for deployment on resource-constrained AUVs.
Also Read:
- Automated Aeroponic Greenhouse System for Disease Detection and Irrigation Control
- Advancing Open-Vocabulary Segmentation for Remote Sensing Images
Insights from Visual Attention
To understand how these models make predictions, Grad-CAM visualizations were employed. These heatmaps highlight the regions of an image that a model considers most important for its classification. The analysis showed that despite advancements, YOLO models could still leverage irrelevant or ‘spurious’ features, sometimes focusing on background elements rather than the object itself. This was particularly evident in the Coral Dataset, where performance didn’t always improve with more training data, and inconsistencies were observed.
The researchers also noted the inherent limitations of applying Grad-CAM to complex, regression-based object detectors like YOLO. Grad-CAM assumes a single class prediction, whereas YOLO generates outputs for every grid cell, often leading to heatmaps that highlight background noise. This discrepancy suggests that current explainability techniques may not fully capture the intricate workings of multi-task models like YOLO.
In conclusion, while newer YOLO versions offer significant improvements in inference speed, their accuracy gains in underwater object detection are minimal beyond YOLOv9. YOLOv10 stands out for its optimal balance of speed and accuracy, making it a strong candidate for AUV deployment. The study also underscores the need for better explainability metrics tailored for complex, multi-task computer vision models. For a deeper dive into the methodology and detailed results, you can access the full research paper here.


