Advancing Depression Detection: A Trimodal Approach Using Brain Signals, Speech, and Text

TLDR: This research paper introduces TRI-DEP, a comprehensive study on detecting depression using a combination of EEG, speech, and text data. It systematically compares different feature extraction methods (handcrafted vs. pre-trained embeddings), neural network architectures, and fusion strategies. The study found that pre-trained embeddings consistently outperform handcrafted features, and that combining all three modalities (EEG, speech, and text) through a majority voting fusion strategy achieves state-of-the-art performance in depression detection, highlighting the complementary nature of these signals.

Depression is a widespread mental health condition that poses significant challenges for automatic detection. While unimodal (single-source) and multimodal (multiple-source) approaches have been explored, existing studies often have limitations such as narrow scope, inconsistent feature comparisons, and varied evaluation methods. A new research paper, TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG, addresses these gaps by systematically investigating feature representations and modeling strategies across three modalities: Electroencephalography (EEG), speech, and text.

The researchers, Annisaa Fitri Nurfidausi, Eleonora Mancini, and Paolo Torroni from the University of Bologna, Italy, aimed to provide a robust and reproducible benchmark for depression detection. Their work involved evaluating handcrafted features against pre-trained embeddings, assessing various neural encoders, comparing unimodal, bimodal, and trimodal configurations, and analyzing different fusion strategies, with a particular focus on the role of EEG.

The study utilized the Multi-modal Open Dataset for Mental-disorder Analysis (MODMA), which includes 5-minute resting-state EEG recordings and audio from structured clinical interviews. Since MODMA lacked text transcriptions, the team generated them automatically using speech-to-text models. To ensure reliable results, they employed stratified 5-fold subject-level cross-validation, preventing data leakage that can inflate performance.

Exploring Modalities and Features

For each modality, the researchers explored different feature extraction methods. For EEG, they used both handcrafted descriptors (statistical, spectral, entropy) and embeddings from large pre-trained models like LaBraM and CBraMod. Speech features included handcrafted MFCCs (Mel-frequency cepstral coefficients) and prosodic features, as well as embeddings from advanced models like XLSR-53 and Chinese HuBERT Large. For text, they relied on embeddings from pre-trained language models such as Chinese BERT Base, MacBERT, XLNet, and MPNet Multilingual.

These features were then fed into various neural network architectures tailored for each modality. For EEG, they considered CNN+LSTM and GRU+Attention encoders. Speech processing involved shallow CNNs followed by pooling or GRU/BiGRU with attention, and then an LSTM. Text features were processed by LSTM or CNN modules.

Multimodal Fusion Strategies

A crucial aspect of the study was the investigation of multimodal fusion. The researchers selected the best-performing feature-model pair for each modality and combined their predictions using late fusion strategies. They explored three main schemes: Bayesian fusion, soft voting (mean averaging of class probabilities), and weighted averaging. This approach allowed them to directly attribute performance gains to the fusion strategy itself.

Also Read:

Key Findings and State-of-the-Art Performance

The results highlighted several important insights. In unimodal settings, text proved to be the most informative single modality, with speech embeddings also showing strong performance. EEG, while valuable, was less predictive in isolation. Pre-trained embeddings consistently outperformed handcrafted features across all modalities, indicating their ability to capture richer, more nuanced information.

When combining modalities, fusion strategies significantly boosted performance. The trimodal configuration, integrating EEG, speech, and text, consistently improved robustness and yielded the strongest overall results. Specifically, a carefully designed trimodal model employing majority voting across all three modalities achieved an F1-score of 0.874, establishing a new state-of-the-art in multimodal depression detection.

The study concludes by proposing an experimental framework that fixes the optimal unimodal predictors and systematically evaluates alternative fusion strategies. This framework serves as a valuable reference for future research, guiding the development of more accurate and robust systems for automatic depression detection.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Advancing Depression Detection: A Trimodal Approach Using Brain Signals, Speech, and Text

Exploring Modalities and Features

Multimodal Fusion Strategies

Key Findings and State-of-the-Art Performance

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Get Well and RhythmX AI Unite to Form GW RhythmX, Pioneering AI-Native Healthcare Intelligence

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates