TLDR: A new end-to-end cochlear implant (CI) system, AVSE-ECS, integrates audio-visual speech enhancement with a deep-learning-based sound coding strategy. By using visual cues like lip movements and a joint training approach, the system significantly improves speech intelligibility for CI users in noisy environments, outperforming previous audio-only methods.
Cochlear implants (CIs) are remarkable devices that allow individuals with severe-to-profound hearing loss to perceive sound. They work by converting speech into electrical signals that stimulate the auditory nerve. While modern CIs have made significant strides, understanding speech in noisy or reverberant environments remains a major hurdle for users.
Recent advancements in deep learning offer promising avenues to enhance CI capabilities. Beyond simply replicating traditional signal processing with neural networks, deep learning allows for the integration of visual cues as additional data for multimodal speech processing. This paper introduces a novel CI system designed to suppress noise, called AVSE-ECS.
Introducing AVSE-ECS: A New End-to-End System
The AVSE-ECS system utilizes an audio-visual speech enhancement (AVSE) model as a pre-processing step for ElectrodeNet-CS (ECS), a deep-learning-based sound coding strategy. Essentially, it’s an end-to-end CI system where both the enhancement and coding stages are trained together. The core idea is to leverage visual information, such as lip movements, to help the system better understand and process speech, especially when background noise is present.
The AVSE component takes both audio and visual inputs. The visual encoder, specifically a Temporal Convolution Network (TCN), focuses on the mouth region of interest (ROI) to extract relevant visual features. These visual features are then fused with the audio information using a cross-attention mechanism, allowing the system to dynamically focus on the most important parts of the input from both modalities.
The enhanced speech from the AVSE module is then fed into the ECS model. ECS is a deep neural network that mimics the essential functions of traditional CI coding strategies, like envelope detection and channel selection, but in a way that can be integrated and optimized within a deep learning framework.
Joint Training for Enhanced Performance
A key innovation of this research is the joint training approach. Instead of training the AVSE and ECS models separately, the entire AVSE-ECS network is optimized simultaneously. This involves defining two types of ‘loss’ functions during training: a spectrogram loss, which ensures the enhanced speech is close to clean speech, and an electrodogram loss, which refines the output electrode patterns to be more distinct and recognizable for the CI. By training the system end-to-end, the AVSE module learns to produce enhanced speech that is specifically optimized for the CI’s sound coding strategy, leading to better speech intelligibility.
Promising Results in Noisy Conditions
Experimental results demonstrate that the proposed AVSE-ECS method significantly outperforms previous ECS strategies, particularly in noisy conditions. When compared to audio-only speech enhancement systems and traditional CI coding strategies like ACE, AVSE-ECS showed improved objective speech intelligibility scores. The addition of visual cues proved crucial, further enhancing the system’s ability to process speech in challenging environments. The joint training method, in particular, achieved the highest scores, validating its effectiveness in refining the electrode stimulation patterns.
Also Read:
- Enhancing Speech Recognition in Noisy Environments with HuBERT-VIC
- Advancing Bioacoustic Understanding Through Comprehensive Machine Learning Study
Future Directions and Impact
While the objective evaluations are promising, the researchers plan to conduct subjective listening tests with both normal-hearing individuals using CI simulations and actual CI users to assess the perceptual benefits. Further studies will also explore the system’s generalization across different languages and datasets, as well as investigate more lightweight models for potential real-time implementation on CI hardware or external edge devices like smartphones or smart glasses. This study’s findings highlight the feasibility and potential of integrating deep learning and multimodal processing for advanced CI sound coding strategies. You can read the full research paper here: End-to-End Audio-Visual Learning for Cochlear Implant Sound Coding in Noisy Environments.


