TLDR: AquaVLM is a novel underwater communication system that uses mobile Vision-Language Models (VLMs) on smartphones to generate and transmit context-aware messages. It addresses limitations of traditional systems by analyzing images and sensor data to create relevant messages, and employs error-resilient fine-tuning for reliable transmission. Evaluated through VR simulations and real-world tests, AquaVLM significantly improves diver situational awareness and communication effectiveness, demonstrating the potential of on-device AI in extreme environments.
Exploring the underwater world, whether for recreation or scientific research, is an incredible experience. However, maintaining safety and effective communication among divers has always been a significant challenge. Traditional underwater communication systems are often cumbersome, expensive, or rely on predefined messages that lack the crucial context of the surrounding environment.
Imagine being able to effortlessly share detailed observations and critical status updates with your dive buddy, just by tapping your smartphone. This is precisely the vision behind AquaVLM, a groundbreaking new system designed to enhance underwater situational awareness using the power of mobile Vision-Language Models (VLMs).
AquaVLM transforms ubiquitous smartphones into smart underwater communication devices. It allows divers to ‘tap-and-send’ messages that are automatically generated and highly context-aware. Instead of relying on a limited set of pre-programmed phrases, AquaVLM analyzes multimodal data – including images captured by the phone’s camera and sensor readings from the phone or a diving watch – to understand the current diving situation. Based on this understanding, it generates suitable message options for the diver to choose from.
The system works in two main stages. First, an existing mobile VLM is specially ‘instruct-tuned’ for underwater scenarios. This involves training it on a custom dataset of underwater conversations, which helps it understand context, generate relevant messages, and even recover corrupted messages. This fine-tuning process incorporates different communication purposes, such as safety alerts or environmental descriptions, allowing the VLM to produce fewer, yet highly relevant, message options, thus reducing computational load on the mobile device.
Second, AquaVLM features ‘error-resilient fine-tuning’. Underwater acoustic transmission is notoriously prone to errors. To combat this, the mobile VLM is further trained on datasets containing randomly corrupted messages. This unique approach allows the VLM to interpret and recover messages even when they contain a certain degree of character corruption, much like how humans can understand text with typos.
To evaluate AquaVLM’s effectiveness, the researchers developed both a virtual reality (VR) simulator and a fully functional prototype on the iOS platform. The VR simulator allowed users to experience AquaVLM in a realistic underwater environment, encountering various events like shark encounters or equipment malfunctions, and communicating with virtual divers. This subjective evaluation showed an impressive 80% ‘purpose-align rate’, meaning the generated messages largely matched the users’ intended communication goals.
Real-world experiments were conducted in a lake, testing the system’s reliability over distances up to 20 meters. The results were highly promising, with AquaVLM consistently maintaining an average of 90% semantic similarity between the original and received messages over distances up to 15 meters. This indicates that the meaning of the messages was largely preserved, even with the challenges of underwater transmission. The system also demonstrated low Bit Error Rates (BER) and acceptable latency for a messaging system.
Compared to existing methods, AquaVLM stands out by offering context-rich, informative messaging through readily available smartphones, without the need for bulky or expensive specialized equipment. It represents a significant leap forward from traditional hand signals, basic diving computer messages, or costly underwater talking devices.
The development of AquaVLM showcases the immense potential of deploying large language models on mobile devices, not just for everyday tasks, but for critical applications in challenging environments. While the current system has an effective transmission distance of around 20 meters and some latency due to VLM inference and transmission, future improvements could include smaller, more efficient models and lightweight underwater modems for greater range and speed.
Also Read:
- Vision Language Models Advance Human Activity Recognition in Healthcare
- Gaze-VLM: Enhancing AI’s Understanding of Human Actions Through Eye Gaze
AquaVLM is more than just a communication tool; it’s a step towards a future where divers can have a richer, safer, and more informed experience exploring the underwater world, all powered by the device in their pocket. For more technical details, you can refer to the full research paper here.


