LTD-Conformer: Speech Depression Detection with Speaking and Listening Perspectives

Title:LTD-Conformer: Speech Depression Detection with Speaking and Listening Perspectives

Authors:Jihun Lee, Jisun Hong, Daegil Choi and Jaehyo Jung

Conference:IEEE CBMS 2025

Tags:Audio, Conformer, Depression, HuBERT, Long-Term Dilated-Conformer and Mel-spectrogram

Abstract:

Depression is a serious mental health problem worldwide and requires quick and accurate diagnosis. Recently, machine learning and deep learning techniques have been actively applied to depression diagnosis research, especially as audio signals are attracting attention as non-invasive and economical methods. This study proposes the Long-Term Dilated-Conformer (LTD-Conformer), an extension of the existing Conformer model designed to utilize audio signals for more accurate depression detection. The LTD-Conformer employs dilated depthwise convolution to achieve a wide receptive field and integrates a Long-Term Module to capture sequential information in audio features. This model comprehensively captures and analyzes the local, global, and sequential patterns in audio signals. In addition, we combined listening (Mel-spectrogram) features and speaking (HuBERT) features to effectively analyze both perspectives of audio signal. The experiment was conducted using DAIC-WOZ dataset, and the LTD-Conformer model achieved an accuracy of 87.04% and an F1-score of 0.87, demonstrating a 4% improvement in accuracy and a 0.04 increase in the F1 score compared to the existing Conformer model. This study presents the possibility that the audio signal-based depression LTD-Conformer model can be effectively applied to depression diagnosis and will develop into a strong audio-based depression diagnosis model in the future.