TLDR: This research paper introduces TRI-DEP, a comprehensive study on detecting depression using a combination of EEG, speech, and text data. It systematically compares different feature extraction methods (handcrafted vs. pre-trained embeddings), neural network architectures, and fusion strategies. The study found that pre-trained embeddings consistently outperform handcrafted features, and that combining all three modalities (EEG, speech, and text) through a majority voting fusion strategy achieves state-of-the-art performance in depression detection, highlighting the complementary nature of these signals.
Depression is a widespread mental health condition that poses significant challenges for automatic detection. While unimodal (single-source) and multimodal (multiple-source) approaches have been explored, existing studies often have limitations such as narrow scope, inconsistent feature comparisons, and varied evaluation methods. A new research paper, TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG, addresses these gaps by systematically investigating feature representations and modeling strategies across three modalities: Electroencephalography (EEG), speech, and text.
The researchers, Annisaa Fitri Nurfidausi, Eleonora Mancini, and Paolo Torroni from the University of Bologna, Italy, aimed to provide a robust and reproducible benchmark for depression detection. Their work involved evaluating handcrafted features against pre-trained embeddings, assessing various neural encoders, comparing unimodal, bimodal, and trimodal configurations, and analyzing different fusion strategies, with a particular focus on the role of EEG.
The study utilized the Multi-modal Open Dataset for Mental-disorder Analysis (MODMA), which includes 5-minute resting-state EEG recordings and audio from structured clinical interviews. Since MODMA lacked text transcriptions, the team generated them automatically using speech-to-text models. To ensure reliable results, they employed stratified 5-fold subject-level cross-validation, preventing data leakage that can inflate performance.
Exploring Modalities and Features
For each modality, the researchers explored different feature extraction methods. For EEG, they used both handcrafted descriptors (statistical, spectral, entropy) and embeddings from large pre-trained models like LaBraM and CBraMod. Speech features included handcrafted MFCCs (Mel-frequency cepstral coefficients) and prosodic features, as well as embeddings from advanced models like XLSR-53 and Chinese HuBERT Large. For text, they relied on embeddings from pre-trained language models such as Chinese BERT Base, MacBERT, XLNet, and MPNet Multilingual.
These features were then fed into various neural network architectures tailored for each modality. For EEG, they considered CNN+LSTM and GRU+Attention encoders. Speech processing involved shallow CNNs followed by pooling or GRU/BiGRU with attention, and then an LSTM. Text features were processed by LSTM or CNN modules.
Multimodal Fusion Strategies
A crucial aspect of the study was the investigation of multimodal fusion. The researchers selected the best-performing feature-model pair for each modality and combined their predictions using late fusion strategies. They explored three main schemes: Bayesian fusion, soft voting (mean averaging of class probabilities), and weighted averaging. This approach allowed them to directly attribute performance gains to the fusion strategy itself.
Also Read:
- Unlocking Selective Listening: A New Brain-Controlled Approach to Speaker Extraction
- NEURORVQ: Enhancing EEG Signal Tokenization for Advanced Brainwave Models
Key Findings and State-of-the-Art Performance
The results highlighted several important insights. In unimodal settings, text proved to be the most informative single modality, with speech embeddings also showing strong performance. EEG, while valuable, was less predictive in isolation. Pre-trained embeddings consistently outperformed handcrafted features across all modalities, indicating their ability to capture richer, more nuanced information.
When combining modalities, fusion strategies significantly boosted performance. The trimodal configuration, integrating EEG, speech, and text, consistently improved robustness and yielded the strongest overall results. Specifically, a carefully designed trimodal model employing majority voting across all three modalities achieved an F1-score of 0.874, establishing a new state-of-the-art in multimodal depression detection.
The study concludes by proposing an experimental framework that fixes the optimal unimodal predictors and systematically evaluates alternative fusion strategies. This framework serves as a valuable reference for future research, guiding the development of more accurate and robust systems for automatic depression detection.


