Advancing Depression Detection: A Multimodal Comparison of Machine Learning and Language Models

TLDR: This research paper evaluates XGBoost, transformer-based models, and large language models (LLMs) for multimodal depression detection using audio, video, and text features from the MPDD Challenge dataset. The study found that transformer models generally outperformed other approaches, especially for younger populations and complex classification tasks, while XGBoost showed strong performance in binary classification. LLMs, despite their size, underperformed, suggesting that model complexity doesn’t always translate to better results in this specific application.

Depression remains a significant global health challenge, affecting millions worldwide, with a large percentage going undiagnosed. Traditional methods for detecting depression, often relying on self-reported questionnaires, struggle to capture the complex and dynamic nature of the condition and can be prone to biases. In response, the computing community has been actively developing automatic depression detection systems using multimodal data, combining information from various sources like audio, video, and text.

A recent study, presented at the 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), delves into this critical area. Titled “Exploring Machine Learning and Language Models for Multimodal Depression Detection,” the research evaluates and compares different computational approaches for identifying depression. The authors, Javier Si Zhao Hong, Timothy Zoe Delaya, Sherwyn Chan Yin Kit, Pai Chet Ng from the Singapore Institute of Technology, and Xiaoxiao Miao from Duke Kunshan University, investigated the performance of XGBoost, transformer-based architectures, and large language models (LLMs) on a novel dataset.

The study was inspired by the first Multimodal Personality-Aware Depression Detection (MPDD) Challenge, which introduced a rich dataset featuring audio and visual recordings of participants in real-world scenarios. This dataset is uniquely annotated with depression severity using the PHQ-9 scale, Big Five personality traits, and detailed demographic information, offering a more diverse and in-depth resource for modeling compared to previous datasets.

Understanding the MPDD Dataset

The MPDD dataset is divided into two tracks: MPDD-Elderly and MPDD-Young, designed for age-specific analysis. The MPDD-Elderly track includes data from older participants (average age 62.8 years) collected during semi-structured hospital interviews, with depression severity assessed using PHQ-9 and HAMD-24 scales. It also includes additional annotations like Big Five personality traits, physical health conditions, financial stress, and family structure. The MPDD-Young track focuses on a younger population (average age 20.0 years) from non-clinical environments, with data collected through self-introductions, questionnaires, and scripted reading tasks. This track includes Big Five traits, age, gender, and place of origin.

The researchers utilized a variety of features from these modalities. Audio features included Mel-frequency cepstral coefficients (MFCCs), OpenSMILE acoustic descriptors, and deep learning-based representations from Wav2Vec 2.0. Visual features comprised deep CNN-based facial embeddings from DenseNet and ResNet, alongside facial behavior analysis from OpenFace. For text, RoBERTa-based embeddings were derived from raw personality trait descriptions.

Investigated Approaches

The paper systematically evaluated three distinct model classes:

XGBoost-Based Model: This traditional machine learning approach uses gradient-boosted decision trees. The researchers applied Principal Component Analysis (PCA) to reduce the dimensionality of audio and visual features, then concatenated them as input. Class weighting was also used to address imbalances in the dataset.
Transformer-Based Model: A deep learning model designed to fuse audio, visual, and text features. It projects inputs into a shared latent space, uses modality-specific transformer encoders, and applies attention pooling to create fixed-length representations. Techniques like Mixup data augmentation and cross-validation were employed to prevent overfitting on the relatively small dataset.
LLM-Based Model: Inspired by Emotion-LLaMA, this approach adapted a LLaMA backbone model. It integrates audio, visual, and textual cues through linear projection mechanisms into a shared embedding space, fine-tuned with a task-specific prompt in a multiple-choice question format.

Also Read:

Key Findings

Experiments were conducted on both the MPDD-Elderly and MPDD-Young datasets across binary, trinary, and quinary classification tasks, using 1-second and 5-second time windows. An ablation study helped optimize each system’s configuration, showing the benefits of PCA and class weighting for XGBoost, and Mixup with cross-validation for the Transformer model.

The results highlighted the strengths and limitations of each model:

On the MPDD-Elderly dataset, XGBoost performed exceptionally well for the 5-second binary classification task, indicating its effectiveness with longer audio segments and well-engineered features. The Transformer model, however, showed superior performance on shorter 1-second segments and more complex classification tasks (ternary and quinary), demonstrating its ability to capture fine-grained information.
For the MPDD-Young dataset, the Transformer model consistently delivered the best results across all classification tasks and time windows, achieving high weighted F1 scores. XGBoost also performed well but generally lagged behind the Transformer, especially in ternary tasks.
Interestingly, despite having significantly more parameters (6,843 million) compared to the Transformer (1.06 million) and XGBoost (0.002 million), the LLM-based approach generally underperformed across most tasks. This suggests that larger models do not always guarantee better performance, especially when dealing with specific, nuanced tasks like multimodal depression detection on specialized datasets.

Overall, the Transformer model emerged as the most effective approach for multimodal depression detection in this study, particularly for younger speakers and shorter audio windows. XGBoost proved to be a strong contender for binary classification, showcasing the power of simpler, well-tuned models for specific tasks. The research offers valuable insights into effective multimodal representation strategies for mental health prediction, paving the way for more advanced and accurate real-world depression recognition systems. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Advancing Depression Detection: A Multimodal Comparison of Machine Learning and Language Models

Understanding the MPDD Dataset

Investigated Approaches

Key Findings

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates