spot_img
HomeResearch & DevelopmentAdvancing Depression Detection: A Multimodal Comparison of Machine Learning...

Advancing Depression Detection: A Multimodal Comparison of Machine Learning and Language Models

TLDR: This research paper evaluates XGBoost, transformer-based models, and large language models (LLMs) for multimodal depression detection using audio, video, and text features from the MPDD Challenge dataset. The study found that transformer models generally outperformed other approaches, especially for younger populations and complex classification tasks, while XGBoost showed strong performance in binary classification. LLMs, despite their size, underperformed, suggesting that model complexity doesn’t always translate to better results in this specific application.

Depression remains a significant global health challenge, affecting millions worldwide, with a large percentage going undiagnosed. Traditional methods for detecting depression, often relying on self-reported questionnaires, struggle to capture the complex and dynamic nature of the condition and can be prone to biases. In response, the computing community has been actively developing automatic depression detection systems using multimodal data, combining information from various sources like audio, video, and text.

A recent study, presented at the 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), delves into this critical area. Titled “Exploring Machine Learning and Language Models for Multimodal Depression Detection,” the research evaluates and compares different computational approaches for identifying depression. The authors, Javier Si Zhao Hong, Timothy Zoe Delaya, Sherwyn Chan Yin Kit, Pai Chet Ng from the Singapore Institute of Technology, and Xiaoxiao Miao from Duke Kunshan University, investigated the performance of XGBoost, transformer-based architectures, and large language models (LLMs) on a novel dataset.

The study was inspired by the first Multimodal Personality-Aware Depression Detection (MPDD) Challenge, which introduced a rich dataset featuring audio and visual recordings of participants in real-world scenarios. This dataset is uniquely annotated with depression severity using the PHQ-9 scale, Big Five personality traits, and detailed demographic information, offering a more diverse and in-depth resource for modeling compared to previous datasets.

Understanding the MPDD Dataset

The MPDD dataset is divided into two tracks: MPDD-Elderly and MPDD-Young, designed for age-specific analysis. The MPDD-Elderly track includes data from older participants (average age 62.8 years) collected during semi-structured hospital interviews, with depression severity assessed using PHQ-9 and HAMD-24 scales. It also includes additional annotations like Big Five personality traits, physical health conditions, financial stress, and family structure. The MPDD-Young track focuses on a younger population (average age 20.0 years) from non-clinical environments, with data collected through self-introductions, questionnaires, and scripted reading tasks. This track includes Big Five traits, age, gender, and place of origin.

The researchers utilized a variety of features from these modalities. Audio features included Mel-frequency cepstral coefficients (MFCCs), OpenSMILE acoustic descriptors, and deep learning-based representations from Wav2Vec 2.0. Visual features comprised deep CNN-based facial embeddings from DenseNet and ResNet, alongside facial behavior analysis from OpenFace. For text, RoBERTa-based embeddings were derived from raw personality trait descriptions.

Investigated Approaches

The paper systematically evaluated three distinct model classes:

  • XGBoost-Based Model: This traditional machine learning approach uses gradient-boosted decision trees. The researchers applied Principal Component Analysis (PCA) to reduce the dimensionality of audio and visual features, then concatenated them as input. Class weighting was also used to address imbalances in the dataset.
  • Transformer-Based Model: A deep learning model designed to fuse audio, visual, and text features. It projects inputs into a shared latent space, uses modality-specific transformer encoders, and applies attention pooling to create fixed-length representations. Techniques like Mixup data augmentation and cross-validation were employed to prevent overfitting on the relatively small dataset.
  • LLM-Based Model: Inspired by Emotion-LLaMA, this approach adapted a LLaMA backbone model. It integrates audio, visual, and textual cues through linear projection mechanisms into a shared embedding space, fine-tuned with a task-specific prompt in a multiple-choice question format.

Also Read:

Key Findings

Experiments were conducted on both the MPDD-Elderly and MPDD-Young datasets across binary, trinary, and quinary classification tasks, using 1-second and 5-second time windows. An ablation study helped optimize each system’s configuration, showing the benefits of PCA and class weighting for XGBoost, and Mixup with cross-validation for the Transformer model.

The results highlighted the strengths and limitations of each model:

  • On the MPDD-Elderly dataset, XGBoost performed exceptionally well for the 5-second binary classification task, indicating its effectiveness with longer audio segments and well-engineered features. The Transformer model, however, showed superior performance on shorter 1-second segments and more complex classification tasks (ternary and quinary), demonstrating its ability to capture fine-grained information.
  • For the MPDD-Young dataset, the Transformer model consistently delivered the best results across all classification tasks and time windows, achieving high weighted F1 scores. XGBoost also performed well but generally lagged behind the Transformer, especially in ternary tasks.
  • Interestingly, despite having significantly more parameters (6,843 million) compared to the Transformer (1.06 million) and XGBoost (0.002 million), the LLM-based approach generally underperformed across most tasks. This suggests that larger models do not always guarantee better performance, especially when dealing with specific, nuanced tasks like multimodal depression detection on specialized datasets.

Overall, the Transformer model emerged as the most effective approach for multimodal depression detection in this study, particularly for younger speakers and shorter audio windows. XGBoost proved to be a strong contender for binary classification, showcasing the power of simpler, well-tuned models for specific tasks. The research offers valuable insights into effective multimodal representation strategies for mental health prediction, paving the way for more advanced and accurate real-world depression recognition systems. For more details, you can read the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -