spot_img
HomeResearch & DevelopmentImproving Suicide Risk Assessment in Adolescents with Dynamic Multimodal...

Improving Suicide Risk Assessment in Adolescents with Dynamic Multimodal Speech Analysis

TLDR: A new research paper introduces a lightweight, multi-branch multimodal network for detecting suicide risk in adolescents. The system integrates time-domain acoustic, time-frequency domain acoustic, and textual features, using a dynamic fusion mechanism to adaptively combine them. By simplifying existing models like Wav2vec 2.0 and BERT, the researchers achieved a 78% reduction in model parameters and a 5% improvement in accuracy compared to the challenge baseline. This approach offers a more efficient and accurate method for speech-based mental health assessment, crucial for early intervention in adolescent suicide prevention.

Suicide remains a tragic leading cause of death among adolescents, making timely identification and intervention crucial. Historically, methods for detecting suicidal tendencies have relied heavily on clinical observations, assessments, or self-reported expressions, which are often time-consuming, labor-intensive, and dependent on extensive medical experience. While machine learning has improved efficiency, it has primarily focused on structured data like medical records, struggling with unstructured information such as chat logs, social media posts, or voice interactions.

In recent years, deep learning has shown remarkable capabilities in processing unstructured data, including text, speech, and behavioral cues. However, much of this research has concentrated on textual data, leaving speech-based analysis relatively unexplored. Speech offers unique advantages for suicide risk monitoring, being cost-effective and enabling continuous, non-invasive assessments. Studies have shown that individuals with suicidal tendencies often exhibit distinct speech patterns, such as reduced efficiency, flattened prosody, monotonic delivery, and a general lack of vocal energy. Spectral characteristics, like variations in energy distribution, pitch, and harmonic content, can also reflect subtle psychomotor and emotional cues associated with suicidal ideation.

To address these challenges, a new research paper, “Dynamic Fusion Multimodal Network for SpeechWellness Detection”, introduces an innovative approach. This study, conducted in the context of the 1st SpeechWellness detection challenge, proposes a lightweight, multi-branch multimodal system designed to detect suicide risk in adolescents. The system integrates information from three distinct modalities: time-domain acoustic features, time-frequency (TF) domain acoustic features, and semantic (textual) representations.

A Comprehensive Multimodal Approach

The proposed network is built upon three main branches, each dedicated to processing a specific type of information:

  • Acoustic Branch in Time Domain: This branch utilizes a lightweight version of Wav2vec 2.0, a powerful pre-trained model that learns high-level acoustic features directly from raw audio waveforms. To enhance computational efficiency, the researchers significantly reduced the model’s size by retaining only the first four layers of its original 24-layer Transformer encoder, achieving an approximate 80% parameter reduction.
  • Acoustic Branch in Time-Frequency (TF) Domain: Recognizing that frequency domain acoustic features are strongly linked to mental health conditions, this branch incorporates a Convolutional Recurrent Neural Network (CRNN). This CRNN extracts rich representations from Mel-spectrograms, which are better aligned with human auditory perception. Mel-spectrograms effectively capture variations in energy distribution, pitch, and prosodic contours that are indicative of suicidal ideation.
  • Semantic Branch: Textual content provides strong cues for assessing suicide risk. This branch first translates speech into text using a state-of-the-art automatic speech recognition model called Paraformer. Subsequently, a lightweight version of BERT (Bidirectional Encoder Representations from Transformers), pre-trained on Chinese corpora, is used to extract deep contextual dependencies from the text. Similar to the Wav2vec 2.0 modification, the BERT model was simplified to reduce its parameter count by about 76%.

Dynamic Fusion for Enhanced Accuracy

A key innovation of this system is the Dynamic Fusion Block. Instead of simply combining the feature vectors from the three branches, this block adaptively integrates the multimodal information. It assigns a learnable scalar weight to each modality (time-domain acoustic, TF-domain acoustic, and semantic). These weights are optimized during training, allowing the model to dynamically adjust the relative importance of each modality based on its contribution to the final prediction. This adaptive approach enhances robustness, especially when certain modalities might be more or less informative in different contexts.

Also Read:

Experimental Validation and Impact

The system was evaluated using a dataset from the 1st SpeechWellness challenge, which includes speech recordings from 600 Chinese teenagers aged 10 to 18, with half identified as at risk of suicide. The experiments demonstrated several key findings:

  • Mel-spectrograms proved slightly more effective than MFCCs (Mel-frequency cepstral coefficients) as TF-domain features.
  • Multimodal systems consistently outperformed monomodal systems, highlighting the benefits of combining different types of information.
  • The proposed model achieved superior performance compared to the official challenge baseline. It delivered a 5% improvement in accuracy while remarkably reducing the total model parameters by 78%.

These results underscore the value of incorporating richer acoustic representations and employing efficient fusion strategies in speech-based mental health assessment. By creating a lightweight yet highly effective system, this research paves the way for more practical and scalable deployment of AI tools on resource-constrained devices, ultimately aiding in the timely detection and prevention of adolescent suicide.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -