0% found this document useful (0 votes)

26 views18 pages

Multimodal Data Fusion For Depression Detection Approach

The article discusses a multimodal data fusion approach for detecting depression using audio and text data, highlighting the importance of early detection for effective intervention. Two fusion networks, early and late, were developed and tested on the DAIC-WOZ and EDAIC-WOZ datasets, with the early fusion model achieving an f1-score of 0.79 and accuracy of 0.86. The study emphasizes the need for comprehensive data analysis to improve detection accuracy and addresses challenges such as class imbalance in datasets.

Uploaded by

spundan08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views18 pages

Multimodal Data Fusion For Depression Detection Approach

Uploaded by

spundan08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Article

Multimodal Data Fusion for Depression Detection Approach

Mariia Nykoniuk 1 , Oleh Basystiuk 1, * , Nataliya Shakhovska 1,2 and Nataliia Melnykova 1

1 Department of Artificial Intelligence, Lviv Polytechnic National University, Stepan Bandera 12,
79013 Lviv, Ukraine; [email protected] (M.N.); [email protected] (N.S.);
[email protected] (N.M.)
2 Department of Civil and Environmental Engineering, Brunel University of London, Uxbridge UB8 3PH, UK
* Correspondence: [email protected]

Abstract: Depression is one of the most common mental health disorders in the world,
affecting millions of people. Early detection of depression is crucial for effective medical
intervention. Multimodal networks can greatly assist in the detection of depression, es-
pecially in situations where in patients are not always aware of or able to express their
symptoms. By analyzing text and audio data, such networks are able to automatically
identify patterns in speech and behavior that indicate a depressive state. In this study,
we propose two multimodal information fusion networks: early and late fusion. These
networks were developed using convolutional neural network (CNN) layers to learn local
patterns, a bidirectional LSTM (Bi-LSTM) to process sequences, and a self-attention mecha-
nism to improve focus on key parts of the data. The DAIC-WOZ and EDAIC-WOZ datasets
were used for the experiments. The experiments compared the precision, recall, f1-score,
and accuracy metrics for the cases of using early and late multimodal data fusion and
found that the early information fusion multimodal network achieved higher classification
accuracy results. On the test dataset, this network achieved an f1-score of 0.79 and an
overall classification accuracy of 0.86, indicating its effectiveness in detecting depression.

Keywords: depression detection; multimodal networks; early fusion; late fusion; mental
health; deep learning

Academic Editors: Dmytro

Chumachenko and Sergiy Yakovlev 1. Introduction
Received: 28 November 2024 Depression is one of the most common mental health disorders worldwide. According
Revised: 22 December 2024 to the World Health Organization (WHO) [1], more than 280 million people, or approxi-
Accepted: 25 December 2024 mately 5% of the adult population, suffer from this condition. The impact of depression
Published: 2 January 2025
goes beyond emotional distress, significantly affecting a person’s ability to perform daily
Citation: Nykoniuk, M.; Basystiuk, tasks, maintain personal relationships, and stay physically and mentally healthy. Symptoms
O.; Shakhovska, N.; Melnykova, N.
of depression can include feelings of persistent sadness, loss of interest in usual activities,
Multimodal Data Fusion for
fatigue, sleep problems, changes in appetite, and concentration [1]. Identifying depression
Depression Detection Approach.
Computation 2025, 13, 9. https://doi.
early is critical, as this mental disorder tends to worsen without proper treatment. Undiag-
org/10.3390/computation13010009 nosed or untreated depression can lead to serious physical and mental complications, such
as chronic illness, reduced quality of life, and even suicidal thoughts or actions. People
Copyright: © 2025 by the authors.
Licensee MDPI, Basel, Switzerland.
suffering from depression often lose the ability to work productively, maintain healthy
This article is an open access article relationships, and even perform basic daily tasks.
distributed under the terms and Artificial intelligence can be of great help in detecting depression, especially in sit-
conditions of the Creative Commons uations wherein patients are not always aware of or able to express their symptoms.
Attribution (CC BY) license
Multimodal approaches combine different types of data, such as analyzing how a person
(https://creativecommons.org/
speaks (audio) and what they say (text), to identify patterns related to depression. The
licenses/by/4.0/).

Computation 2025, 13, 9 https://doi.org/10.3390/computation13010009

Computation 2025, 13, 9 2 of 18

multimodal approach (combining audio and text) was chosen over unimodal analysis
because it provides more complete data about the participant’s emotional and mental state.
While text-only analysis focuses on the content of what is said, it does not take into account
important paralinguistic cues present in the audio, such as tone, pitch, and rhythm, which
can be strong indicators of emotional distress. Similarly, audio-only analysis does not take
into account linguistic patterns in spoken language, such as specific word usage or sentence
structure, which can reveal cognitive signs of depression.
By integrating audio and textual modalities, the model can detect subtle clues that
would otherwise be missed if it focused on only one type of data. This approach provides
a more holistic view of the participant’s state, significantly increasing the accuracy of
depression detection, especially in cases wherein one modality alone may not be sufficient
to detect depression.
The contribution of this work is highlighted as follows:
• This work outlines an effective data preprocessing pipeline, addressing the issue of
class imbalance.
• The study proposes and develops two multimodal networks for depression detection—
early fusion and late fusion models. These models combine both audio and text
modalities to capture a more holistic set of features relevant to identifying depres-
sive symptoms.
• The performance of the developed models is rigorously evaluated using multiple
performance measures, including accuracy, recall, precision, and f1-score.
• A detailed interpretation of its results on both validation and test datasets, illustrating
its effectiveness in detecting depression across various scenarios.

2. Related Work
Depression detection using artificial intelligence has been an active area of research
in recent years. Various approaches have been applied, utilizing different types of data
such as text, audio, and even visual information. Early works in this domain focused
predominantly on single-modality data (either text or audio) for classification tasks.
This article could be improved by integrating a broader and more representative range
of literature sources that cover the development and evaluation of depression detection
models using artificial intelligence. The current review primarily references a limited
selection of studies and does not fully capture the diversity of approaches in the field,
particularly when it comes to multimodal detection, dataset balancing, and model perfor-
mance evaluation. To demonstrate a more comprehensive understanding of the field, this
article could incorporate references to key papers that explore recent advancements in both
single-modality and multimodal approaches.
While studies such as [2–5] provide valuable insights into the use of speech-related
features like log spectrograms, MFCC, COVAREP, and HOSA for detecting depression,
this article overlooks some of the more recent methods in this area, such as newer feature
extraction techniques or improved neural network architectures. Including references to
works that explore deep learning models beyond CNNs and RNNs [6], such as attention-
based mechanisms or self-supervised learning for speech-based depression detection,
would broaden the discussion.
This article acknowledges the use of traditional NLP methods such as Bag of Words
(BoW) and TF-IDF, as well as more recent deep learning models like LSTM for text-based
depression detection. However, there is a growing body of work on pre-trained transformer
models (e.g., BERT, RoBERTa, S-RoBERTa), which have shown significant improvements in
understanding linguistic cues from textual data. Citing studies that explore these advanced
Computation 2025, 13, 9 3 of 18

transformer models for better accuracy in depression detection through text analysis would
enhance this article’s coverage of text-based approaches.
While studies like [7–9] are cited for their multimodal fusion approaches, this article
could benefit from incorporating additional works that investigate the use of multimodal
systems combining not only speech and text but also visual cues, such as facial expressions
and body language, for more holistic depression detection. Including studies that explore
early fusion or hybrid fusion models could expand the discussion on how different data
modalities can be leveraged together.
This article rightly points out the issue of dataset imbalance in studies like [7,8], but
it could strengthen this point by discussing techniques for addressing class imbalance,
such as oversampling, undersampling, and data augmentation, which are commonly
used in depression detection research. Additionally, incorporating literature on fairness
and model generalizability would highlight the importance of ensuring that models are
not biased toward over-represented classes and can effectively perform across diverse
patient populations.
Some of the studies reviewed (e.g., [9]) achieve low accuracy, indicating that such
models may not be reliable enough for practical use in clinical settings or everyday use.
Finally, this article could deepen its analysis of model performance, particularly
with reference to accuracy and reliability in clinical or real-world settings. While studies
like [9] report lower performance metrics, a more nuanced discussion of evaluation metrics
beyond accuracy—such as precision, recall, f1-score, and area under the ROC curve—would
provide a clearer picture of how these models perform in various contexts. By expanding
the citation range and including these relevant works, this article would demonstrate a more
thorough understanding of the state of the field and provide a more balanced overview of
the challenges and opportunities in depression detection using artificial intelligence.
Thus, the main challenge in the field of depression diagnosis is to develop reliable,
accurate, and fair models that can effectively detect signs of depression. Existing method-
ologies have several serious problems, including unbalanced datasets, dependence on
a single data modality, and generally low prediction accuracy. Addressing these issues
will significantly increase the reliability of depression detection tools, making them more
effective for clinical and personal health monitoring.

3. Proposed Approach
3.1. Dataset
This study uses the DAIC-WOZ (Depression and Anxiety Interview Corpus—Wizard
of Oz) dataset [10], which is one of the key datasets used for research in the field of detecting
depression and anxiety disorders based on multimodal data.
DAIC-WOZ contains recordings of interviews between participants and a virtual
agent named Ellie, who was programmed as an interlocutor to collect information about a
person’s mental state. The virtual agent asks participants a variety of questions about their
mental health, emotional state, life events, etc.
The DAIC-WOZ dataset contains multimodal interview recordings, including
the following:
• Textual data. Interview transcripts that provide a detailed record of verbal commu-
nication between the patient and the virtual interviewer. The transcripts contain the
exact start and end time of each cue, which makes it possible to synchronize the audio
with the text.
• Audio recordings. The set contains audio files that record the voice of patients dur-
ing interviews.
Computation2025,
Computation 2025,13,
13,9x FOR PEER REVIEW 4 of 18
4 of 18

The
TheDAIC-WOZ
DAIC-WOZ contains several
contains important
several features
important that that
features makemake
it unique for a study
it unique for a study
of
ofdepression
depressiondetection:
detection:
• The
Thedataset
datasetincludes
includes 189 interviews,
189 interviews,which is ais
which small dataset
a small size for
dataset sizemodel training.
for model training.
• The length of the interview audio recordings can vary from a few
The length of the interview audio recordings can vary from a few minutesminutes to halftoan
half an
hour, depending on the pace of the conversation and the number of questions
hour, depending on the pace of the conversation and the number of questions asked asked
to the patient.
to the patient.
 Each text file has time annotations that allow you to synchronize text and audio.
• Each text file has time annotations that allow you to synchronize text and audio.
These annotations are important for analysis because they allow us to more accu-
These annotations are important for analysis because they allow us to more accurately
rately identify specific moments in the interview where the patient shows signs of
identify specific moments in the interview where the patient shows signs of depression.
depression. They also simplify the procedure of removing interviewer questions
They also simplify the procedure of removing interviewer questions from the text and
from the text and audio, as these annotations contain the beginning and end of each
audio, as these annotations contain the beginning and end of each line.
line.
Oneofofthe
One thekey
keyproblems
problems with
with thethe DAIC-WOZ
DAIC-WOZ dataset
dataset is imbalance
is the the imbalance of classes
of classes
between patients with and without depression.
between patients with and without depression.
Theextended
The extendeddataset
dataset EDAIC-WOZ
EDAIC-WOZ (Extended
(Extended Depression
Depression and and Anxiety
Anxiety Interview
Interview
Corpus—WizardofofOz)
Corpus—Wizard Oz) [10]
[10] is is
anan additional
additional resource
resource to basic
to the the basic DAIC-WOZ
DAIC-WOZ dataset.
dataset. It It
containsmore
contains morerecords
recordsofof interviews
interviews withwith patients
patients diagnosed
diagnosed withwith depression,
depression, whichwhich
were were
not
notincluded
includedininthe
theoriginal
original dataset (Figure
dataset 1). 1).
(Figure

Figure1.
Figure Comparativeplot
1. Comparative plotof
ofdepression
depressionclass
classdistribution
distributionin
inDAIC-WOZ
DAIC-WOZand
andEDAIC-WOZ
EDAIC-WOZda-
datasets.
tasets.
The purpose of this extended dataset is to increase the amount of data available for
depression analysis
The purpose andextended
of this providedataset
researchers with more
is to increase the opportunities
amount of datatoavailable
train models
for on
multimodal
depression data. and provide researchers with more opportunities to train models on
analysis
multimodal data. time, the extended EDAIC-WOZ dataset has some drawbacks related to
At the same
transcript
At the quality.
same time,Although the transcripts
the extended EDAIC-WOZ in this dataset
dataset contain
has some the exactrelated
drawbacks start and
to end
times of each
transcript replica,
quality. there are
Although the several serious
transcripts problems
in this that make
dataset contain the them
exact difficult
start andto
endprocess:
times
1. of each
Lack ofreplica,
speaker there are severalThe
separation. serious problems that
EDAIC-WOZ make them
transcripts do diﬃcult to pro-
not clearly indicate
cess: whether a particular line belongs to the patient or the interviewer. This creates
1. Lack of speaker
confusion, separation.the
as sometimes The EDAIC-WOZ
interviewer’s transcripts
question do not
and the clearlyanswer
patient’s indicate
may be
whether a particular line belongs
combined in one text segment. to the patient or the interviewer. This creates con-
2. fusion, as sometimes
Inaccuracy the interviewer’s
of transcripts. question
Transcripts in and thedataset
the extended patient’s
may answer may be This
be inaccurate.
combined
means thatin one
what text
is segment.
written in the transcript does not always accurately reflect what
2. Inaccuracy of transcripts. Transcripts in the extended dataset may be inaccurate. This
was actually said by the interviewee. This mismatch between the actual audio data
means that what is written in the transcript does not always accurately reflect what
and the text complicates the data preparation process, as it is necessary to manually
was actually said by the interviewee. This mismatch between the actual audio data
double-check the transcripts against the audio. This is especially important for the
and the text complicates the data preparation process, as it is necessary to manually
accurate extraction of textual features, which can significantly affect the overall quality
double-check the transcripts against the audio. This is especially important for the
of the model.
For this study, only those recordings that are labeled as depression and do not belong
to the regular DAIC-WOZ dataset were taken from the extended EDAIC-WOZ dataset
(Figure 2).
accurate extraction of textual features, which can significantly aﬀect the overall qual-
to the
ityregular DAIC-WOZ dataset were taken from the extended EDAIC-WOZ dataset
of the model.
(Figure 2).
For this study, only those recordings that are labeled as depression and do not belong
Computation 2025, 13, 9 to the regular DAIC-WOZ dataset were taken from the extended EDAIC-WOZ dataset 5 of 18
(Figure 2).

Figure 2. Graph of the final distribution of the depression classes.

Graphofofthe
Figure2.2.Graph
Figure the final
final distribution
distribution of the
of the depression
depression classes.
classes.
The final data distribution consists of 120 non-depression cases from the DAIC-WOZ
Thefinal
dataset,
The final
40 datadistribution
data distribution
depression cases from consists of 120
theofDAIC-WOZ
consists 120 non-depression
dataset,
non-depression and
cases cases from the DAIC-WOZ
20 depression
from the DAIC-WOZ cases from
dataset,
dataset, 40 depression
40 depression
the EDAIC-WOZ cases from
cases This
dataset. the
from approach DAIC-WOZ
the DAIC-WOZ dataset,
dataset,
effectively and 20 depression
and 20 depression
increases the number cases cases
from fromin
of examples
the
the EDAIC-WOZ
the EDAIC-WOZ dataset.
dataset.
depression class This approach
This approach
without duplicating effectively
effectively increases
existingincreases
data from the
the the
number number of examples
of examples
original DAIC-WOZ in in
set.
thedepression
the
By depressionadditional
including classwithout
class without duplicating
duplicating
depression existing
existing
cases from data
data from
thefrom thethe
extended original
original DAIC-WOZ
DAIC-WOZ
EDAIC-WOZ set.set. this
dataset, By
including
By includingadditional
additionaldepression
depression cases
casesfrom
fromthethe
extended
extendedEDAIC-WOZ
EDAIC-WOZ dataset,
dataset,thisthis
method
method addresses the problem of class imbalance, where the depression class is un-
addresses
method the problem
addresses of class imbalance,
the problem where the
of class imbalance, depression
where class is underrepresented
derrepresented compared to non-depression cases. Thethe depression
adjusted class is un-
distribution ensures a
compared to non-depression
derrepresented cases. The adjusted
compared to non-depression cases. Thedistribution ensures a more
adjusted distribution ensures balanced
a
more balanced representation of both classes for analysis.
more balanced representation of both
representation of both classes for analysis. classes for analysis.
This will effectively increase the number of examples of the class “depression” with-
This
Thiswill
willeffectively
effectivelyincrease
increasethethenumber
number ofof
examples
examples of of
thethe
class “depression”
class “depression” with-without
out duplicating existing data from the original set. This approach can help to address the
out duplicating existing data from the original set. This approach can help to
duplicating existing data from the original set. This approach can help to address the prob- address the
problemofofimbalanced
problem imbalanced classes,
classes, wherein
wherein datadata
with with depression
depression are significantly
aresignificantly
significantly less repre-
less represented
repre-
lem of imbalanced classes, wherein data with depression are less
sentedcompared
sented comparedtotodata data without
without depression.
depression.
compared to data without depression.
3.2.
3.2. Data
3.2.Data Processing
DataProcessing
Processing
Audio.
Audio. When
Audio.When
When working
working
working with
with
with audio
audio
audio data
data in depression
depression
in depression
data in detection
detection tasks,
tasks,tasks,
detection ititisisimportant
it is important important
to
to properly
properly prepare
preparethe data
the to to
data ensure thethe
ensure accuracy of the
accuracy of model.
the The audio
model. The preprocessing
audio preprocessing
to properly prepare the data to ensure the accuracy of the model. The audio preprocessing
process
process includes several important steps, eacheach
of which is aimed at improving data quality
process includes several
includes several important
important steps,
steps, each of ofwhich
which isisaimed
aimed atimproving
at improving dataquality
data quality
and increasing the efficiency of further feature analysis (Figure
and increasing the efficiency of further feature analysis (Figure 3). 3).
and increasing the efficiency of further feature analysis (Figure 3).

Figure 3. Audio data preprocessing algorithm.

Figure 3.
Figure Audio data
3. Audio data preprocessing
preprocessing algorithm.
algorithm.

According to the diagram above, the stages of audio data preprocessing include
the following:
1. Removal of interviewer segments. First of all, all segments where the interviewer
speaks are removed. This is important for analyzing only the participant’s speech, as
we are interested in identifying signs of depression in the patients’ responses.
2. Dividing into training and test samples. In the next step, the data are divided into
training and test samples for training and evaluation of the model in the ratio of 80:20.
Computation 2025, 13, 9 6 of 18

The training set is used to train the model, and the test set is used to check its accuracy
on new data.
3. Training sample augmentation (tone change). To increase the amount of data in the
training set and balance the classes, audio data are augmented by changing the tone
of speech, which allows you to create new variations of data without losing meaning.
In this case, the tone is changed by shifting the signal frequencies by 1.5 semitones.
This means that the pitch of the participant’s voice was lowered by 1.5 semitones.
It is important to note that changing the pitch within a small range, in this case by
1.5 semitones, allows us to preserve the naturalness of the voice while providing new
audio variations.
4. Removal of extraneous noise. At this stage, background noise is removed from the
audio. This includes clearing the audio of unwanted sounds such as noise, hum, or
extraneous conversations.
5. Divide into segments of 1 min duration. Audio recordings are divided into 1 min seg-
ments. Segments that are too short (e.g., 5–10 s) may not contain enough information
about the emotional state or tone of the respondent. When broken down into one-
minute segments, the context of the conversation is preserved, which is important for
understanding changes in the participant’s tone and mood during the conversation.
6. Calculating audio features (Mel spectrograms, MFCC, spectral contrast). For each
segment, audio features such as Mel spectrograms, MFCC coefficients (Mel frequency
cepstral coefficients), and spectral contrast are calculated.
# A Mel spectrogram is a spectrum of audio frequencies calculated on the Mel
scale [11]. The Mel spectrogram shows the distribution of energy across fre-
quencies over time, reflecting changes in voice and intonation in speech. It is
well suited for analyzing temporal changes in audio.
# MFCC are coefficients derived from the Mel spectrogram, which is a numerical
representation of the sound spectrum on the Mel scale [12]. They allow for
even more accurate characterization of audio signals. MFCCs are well suited
for analyzing speech data because they capture the features of the speaker’s
articulation. They are able to distinguish tone, rhythm, intonation, and other
speech aspects.
# Spectral contrast is a method that measures the difference between the max-
imum and minimum amplitudes in different frequency bands [13]. In the
case of speech analysis for depression, it is useful for recognizing changes in
the voice, such as monotony or lack of expression, which are typical signs
of depression.
These features make it possible to identify voice and speech characteristics that may
indicate a depressive state. These features will be used as input for training the model.
7. Normalize the data. This step transforms the data so that they have a mean value of 0
and a standard deviation of 1. As a result, all features are brought to the same scale,
which is critical for some machine learning models.
8. Dividing the training sample into training and validation samples. The training
sample is further divided into training and validation samples in the ratio of 80:20.
This makes it possible to evaluate the model performance on the validation data
during the training process in order to adjust the model parameters.
Thus, this sequence of steps helps to prepare the audio data properly, which ensures
high-quality training and subsequent classification of depression signs.
Text. Text data preprocessing (Figure 4) is an important step as it prepares textual cues
for further feature extraction and removes redundant or uninformative information. This
the training process in order to adjust the model parameters.
Thus, this sequence of steps helps to prepare the audio data properly, which ensures
Computation 2025, 13, 9
high-quality training and subsequent classification of depression signs.
7 of 18
Text. Text data preprocessing (Figure 4) is an important step as it prepares textual
cues for further feature extraction and removes redundant or uninformative information.
This
helpshelps the model
the model to focus
to focus onrelevant
on the the relevant
data,data,
which which improves
improves the quality
the quality of training
of training and
and the accuracy
the accuracy ofresults.
of the the results.

Figure 4.
Figure Text data
4. Text data preprocessing
preprocessing algorithm.
algorithm.
According to the diagram below, the stages of text data preprocessing include
According to the diagram below, the stages of text data preprocessing include the
the following:
following:
1. Removal of interviewer’s remarks. The first step is to remove all questions asked
1. Removal of interviewer’s remarks. The first step is to remove all questions asked by
by the interviewer. For data from the extended dataset (EDAIC-WOZ), where the
the interviewer.
annotations are For data from
inaccurate, the
the extended dataset
interviewer’s remarks (EDAIC-WOZ),
are not separated whereand the anno-
com-
tations are inaccurate, the interviewer’s remarks are not
pletely removed. In this case, the removal of interviewer questions is conducted separated and completely
removed.
manuallyIn tothis
ensurecase, the removal
consistency of interviewer
between text andquestions
audio andistoconducted manuallyonto
focus the analysis
ensure consistency
the participants’ responses. between text and audio and to focus the analysis on the partici-
2. pants’ responses.
Correction of errors in the participant’s answers. Transcripts from the extended
2. Correction of errorscontain
dataset sometimes in the errors
participant’s answers. Transcripts
or inconsistencies between what from was thesaid
extended
and what da-
taset sometimesAtcontain
was recorded. errors
this stage, or inconsistencies
a manual betweentowhat
check is performed correct was saiderrors
these and what
and
was
bringrecorded.
the text in Atline
thiswith
stage,
thea actual
manual check isThis
answers. performed
providestoa correct these errors
more accurate and
analysis
bring the text
of the text data. in line with the actual answers. This provides a more accurate analysis
3. of the text into
Dividing data.training and test samples. To ensure the reliability of the results, the
3. Dividing into
data are divided training and test and
into training samples. To ensure
test samples in the theratio
reliability
of 80:20.of the
Thisresults,
helps the
to
data
checkarehow divided
the modelinto will
training
workandon newtest data
samples in the ratio
and prevents of 80:20. This helps to
overfitting.
4. check howsample
Training the model will work on
augmentation. new
Text data and prevents
augmentation is usedoverfitting.
to increase the amount
4. Training
of data and sampleimproveaugmentation. Text augmentation
model accuracy. is used towords
In this case, replacing increase thesynonyms
with amount of
data and improve model accuracy. In this case, replacing
makes it possible to generate new text variants while maintaining the meaning. words with synonyms
Each
makes it possible
text fragment to generate
is analyzed new textwords
to identify variants
thatwhile
can bemaintaining
replaced with thesynonyms.
meaning. EachFor
text
thisfragment
purpose, isthe analyzed
WordNet to lexical
identifydatabase
words that can be
is used, replaced
which with synonyms
contains synonyms.for For
this purpose,
various words. theInWordNet
addition, lexical
not all database
words canisbe used, whichThe
replaced. contains
text is synonyms
processed in for
parts of words.
various 100 words, and the probability
In addition, not all wordsof replacing
can be words
replaced. withThe
their synonyms
text is 15%.
is processed in
The 15%
parts of 100probability
words, and wasthechosen as a compromise
probability of replacing thatwords
allowswith for adding variabilityis
their synonyms
to theThe
15%. data15% without changing
probability wasthe text too
chosen as aradically.
compromise A higher percentage
that allows can change
for adding varia-
the semantics
bility to the data of the text, which
without changing is undesirable,
the text too especially
radically. A when
higheranalyzing
percentagementalcan
states, where every detail is important. In addition, stop
change the semantics of the text, which is undesirable, especially when analyzing words (e.g., “I”, “you”,
“depression”)
mental states, are where specifically excluded
every detail from the In
is important. listaddition,
of replacement words.(e.g.,
stop words This“I”,
is
done to avoid changing keywords that may have a significant impact on the model or
are critical to the analysis of depression. Augmentation is applied to all data with the
depression class in the training set.
5. Split into segments by length. Based on the number of segments when splitting audio
data, the text is divided into the same number of segments by word count.
Computation 2025, 13, 9 8 of 18

6. Expanding contractions. During the interview, contractions may be used that need to
be expanded to better understand the model.
7. Removal of punctuation marks, symbols, numbers, and words with less than two
characters. This step involves removing non-informative characters such as punctua-
tion, numbers, and words that are too short and do not have a meaningful impact on
the analysis.
8. Calculation of textual features (TF-IDF). At this stage, textual features are calculated
by increasing the weight of words related to depression and applying the TF-IDF
(term frequency–inverse document frequency) method, which measures the frequency
of each word in the text and its importance relative to other texts.
To increase the weight of words related to depression, a set of keywords (symptoms)
that are directly related to the depressive state and pre-trained GloVe vectors (Global
Vectors for Word Representation) are used, which represent words as numerical vectors in
a multidimensional space. For each keyword, GloVe vectors are used to find semantically
similar words. This is achieved by calculating the cosine similarity between the word
vectors, which allows us to determine how similar two words are in their contexts in the
texts. After that, for each keyword, the top 10 most similar words are determined and
added to the main list.
When the list of words related to depression is supplemented with similar words,
these words receive increased weight during text processing. This is achieved by repeating
the word several times so that the model gives them more weight compared to other words.
After the weights of important words are increased, TF-IDF is used to vectorize
the text.
9. Dividing the training sample into training and validation samples. After all the stages
of text data preprocessing are completed, the training data are divided into training
and validation samples in the ratio of 80:20. The training set is used to train the
model, while the validation set is used to evaluate its performance and adjust the
hyperparameters.
After the text data preprocessing is completed, the data are ready for further use in the
model training process. It is important to ensure that the quality of the cleaned data is high,
as even minor errors at this stage can lead to distortion of the results during modelling and
reduce the accuracy of predictions.

3.3. A Multimodal Network for Depression Detection

Multimodal networks are neural architectures that combine different types of data to
obtain deeper analysis and make predictions with increased accuracy [14,15]. In the context
of detecting participants’ depression, a multimodal network is a tool that can combine data
from different formats, such as text interviews and audio recordings. This combination
allows for capturing not only the content of what was said (through textual data), but
also how it was said (through audio), which is especially important in the diagnosis of
depression. In particular, textual data can contain important depressive keywords or
phrases, while audio can provide information about the patient’s emotional state through
tone, speech rate, pauses, etc. An important aspect of multimodal networks is that different
data sources, such as audio and textual data, can interact with each other, providing the
model with more context for decision making.
There are two main approaches to data fusion in multimodal neural networks: early
and late fusion. These methods determine at what stage of processing features from
different modalities, such as audio and text, are combined.
ing the model with more context for decision making.
There are two main approaches to data fusion in multimodal neural networks: early
and late fusion. These methods determine at what stage of processing features from dif-
Computation 2025, 13, 9 9 of 18
ferent modalities, such as audio and text, are combined.

3.3.1. Early Fusion

3.3.1. Early FusionModel
Model
The architecture
The architectureofofthe
theearly
earlyfusion
fusion model
model shown
shown in Figure
in Figure 5 combines
5 combines audio
audio and and
textual featuresatatan
textual features anearly
earlystage,
stage,allowing
allowing
forfor
thethe eﬃcient
efficient fusion
fusion of diﬀerent
of different types
types of data
of data
for more
more accurate
accuratedepression
depressionclassification.
classification.

5. Early
Figure 5.
Figure Earlyfusion
fusionmultimodal
multimodalnetwork
networkarchitecture.
architecture.

Input data:
Input data:
• Audio: the model is fed with extracted features from the audio, such as Mel spec-
 Audio: the model is fed with extracted features from the audio, such as Mel spectro-
trograms, MFCC (Mel frequency cepstral coefficients), and spectral contrast. These
grams, MFCC
features (Mel
reflect the frequency
acoustic cepstral coeﬃcients),
characteristics of the signal.and spectral contrast. These fea-
• turesText
Text: reflect the
data areacoustic characteristics
represented of the signal.
as feature vectors extracted using TF-IDF.
 Text: Text data are represented as feature vectors extracted using TF-IDF.
The model integrates audio and text features through a concatenation layer, allowing
The
for the model integrates
combined processingaudio
of bothand text
data features
types. through
A dense layer awith
concatenation
128 neuronslayer, allowing
and ReLU
for the combined processing of both data types. A dense layer with 128 neurons
activation reduces the complexity of the input, capturing nonlinear feature relationships. and ReLU
activation reduces thelayer
Next, a convolutional complexity
extracts of the input,
important capturing
patterns, nonlinear
especially fromfeature relationships.
audio data, with
Next,
batch anormalization
convolutionalhelping
layer extracts
stabilizeimportant patterns,
the learning especially
process. Dropoutfrom audio
layers (30%data,
andwith
50%) prevent
batch overfitting,
normalization while
helping the Bi-LSTM
stabilize layer with
the learning 64 neurons
process. Dropout captures sequential
layers (30% and 50%)
dependencies
prevent in bothwhile
overfitting, directions. A four-headed
the Bi-LSTM self-attention
layer with 64 neurons mechanism
captures helps the model
sequential depend-
focus on key aspects of the input. Global average pooling reduces data dimensionality,
encies in both directions. A four-headed self-attention mechanism helps the model focus
followed
on by a final
key aspects dense
of the layer
input. withaverage
Global 64 neurons, which
pooling prepares
reduces datathe data for the binary
dimensionality, followed
by a final dense layer with 64 neurons, which prepares the data for theof
classification output via the Sigmoid function, determining the presence depression.
binary classification
This architecture combines the power of convolutional networks to learn local patterns,
output via the Sigmoid function, determining the presence of depression.
Bi-LSTM for contextual sequence processing, and a self-attention mechanism to improve
This architecture combines the power of convolutional networks to learn local pat-
focus on key parts of the data.
terns, Bi-LSTM for contextual sequence processing, and a self-attention mechanism to im-
prove focus
3.3.2. Late on key
Fusion parts of the data.
Model
The architecture of the late fusion model consists of two separate branches for pro-
cessing audio and text features, which are combined at a later stage (Figure 6). This makes
it possible to first extract specific features from each modality (audio and text), which
3.3.2. Late Fusion Model

Computation 2025, 13, 9

The architecture of the late fusion model consists of two separate branches for pro-
10 of 18
cessing audio and text features, which are combined at a later stage (Figure 6). This makes
it possible to first extract specific features from each modality (audio and text), which im-
proves
improvesthe quality
the qualityofofprocessing for each
processing for eachdata
datatype,
type,and
and then
then combine
combine these
these features
features for for
joint analysis.
analysis.

6. Late
Figure 6.
Figure Latefusion
fusionmultimodal
multimodalnetwork
networkarchitecture.
architecture.

The audio data are processed using a series of convolutional layers (CNNs) to extract
The audio data are processed using a series of convolutional layers (CNNs) to extract
local features, followed by Bi-LSTM to capture long-term dependencies in the sequence.
local features, followed by Bi-LSTM to capture long-term dependencies in the sequence.
The initial convolutional layers apply filters to detect patterns in the audio, with batch
The initial convolutional layers apply filters to detect patterns in the audio, with batch
normalization and dropout layers to maintain stability and prevent overfitting. The Bi-
normalization
LSTM layer, with and128
dropout
units, layers to maintain
processes stability
the sequential andofprevent
nature overfitting.
audio data, The Bi-
extracting
LSTM
temporal layer, with 128 units,
relationships processessegments.
from different the sequential nature
A final denseoflayeraudio data, reduces
further extractingthetem-
poral relationships
dimensionality from diﬀerent
of extracted features, segments. A final
preparing them for dense layer further
combination with textreduces
features.the di-
mensionality of extracted features, preparing them for combination with
For text data, a Bi-LSTM layer with 64 units is used to capture context in both forwardtext features.
For text data,
and backward a Bi-LSTM
directions. layer with 64mechanism
The self-attention units is used to capture
then focuses on context in both
the most forward
relevant
and
partsbackward
of the text,directions. The self-attention
making it easier for the modelmechanism then focuses
to identify crucial on the
indicators most relevant
of depression.
parts
Globalofaverage
the text, making
pooling it easier
reduces thefor the model
number to identify
of features, crucial
retaining the indicators of depression.
essential information
for classification.
Global average pooling reduces the number of features, retaining the essential infor-
mation forseparate
After processing, the audio and text features are combined using a concate-
classification.
nation layerseparate
After in a lateprocessing,
fusion stage.the
The resulting
audio feature
and text set is are
features thencombined
passed through
using dense
a concate-
layers with ReLU activation to refine the combined features. A final output
nation layer in a late fusion stage. The resulting feature set is then passed through dense layer with a
Sigmoid
layers activation
with function is to
ReLU activation used to classify
refine the combined
the combined representation
features. as indicating
A final output layer with a
the presence or absence of depression.
Sigmoid activation function is used to classify the combined representation as indicating
The chosen architecture makes it possible to first maximize the use of specific audio
the presence or absence of depression.
and text features and then combine them efficiently, which can contribute to better overall
The chosen architecture makes it possible to first maximize the use of specific audio
model performance.
and text features and then combine them eﬃciently, which can contribute to better overall
model performance.
4. Results
To implement multimodal data fusion models, the Google Colab environment was
4. Results
used, which is a powerful tool for conducting experiments and training neural networks.
Towork,
In this implement
a T4 GPUmultimodal
with 15 GBdata fusion
of GPU models,
RAM the Google
was used to trainColab environment
the models. Python was
used,
3.8 waswhich is as
chosen a powerful tool for conducting
the programming language forexperiments andfollowing
the project. The training libraries
neural networks.
were
Computation 2025, 13, x FOR PEER REVIEW 11 of 18
Computation 2025, 13, x FOR PEER REVIEW 11 of 18

Computation 2025, 13, 9 11 of 3.8

18
In this work, a T4 GPU with 15 GB of GPU RAM was used to train the models. Python
In this
was work,
chosen as athe
T4 GPU with 15 GBlanguage
programming of GPU RAM wasproject.
for the used toThe
trainfollowing
the models. Pythonwere
libraries 3.8
was in
used chosen as the programming
the process: language
NumPy, Pandas, nltkfor the project.
(Natural The following
Language Toolkit),libraries
nlpaug,were
noi-
used in the process: NumPy, Pandas, nltk (Natural Language Toolkit), nlpaug, noisereduce,
used
sereduce,in the
pydub,process: NumPy,
librosa, Pandas,
matplotlib, nltk
seaborn, (Natural Language
scikit-learn, Toolkit),
tensorflow, keras. nlpaug, noi-
pydub, librosa, matplotlib, seaborn, scikit-learn, tensorflow, keras.
sereduce, pydub, librosa, matplotlib, seaborn, scikit-learn, tensorflow, keras.
4.1. Performance
4.1. Performanceofofa aMultimodal
MultimodalEarlyEarlyFusion
Fusion Network
Network
4.1. Performance of a Multimodal Early Fusion Network
The
Theearly
earlyfusion
fusionmodel
modelwas wastrained
trainedusing using the
the binary_crossentropy
binary_crossentropy loss lossfunction
functionfor for
The early fusion model was trained using the binary_crossentropy loss function for
binary
binary classification
classification(depression/non-depression)
(depression/non-depression)and and the
the accuracy metric for
accuracy metric forevaluation
evaluation
binary classification (depression/non-depression) and the accuracy metric for evaluation
ofof model.The
model. Themodel
modelwas
wastrained
trainedoverover25 25epochs.
epochs. In In addition,
addition, to ensure
ensure aastable
stablelearning
learning
of model. The model was trained over 25 epochs. In addition, to ensure a stable learning
process,the
process, theAdam
Adamoptimizer
optimizerwas wasusedused with
with a predefined
predefined learning
learningraterateset
setatat0.00001.
0.00001.
process, the Adam optimizer was used with a predefined learning rate set at 0.00001.
Theplots
The plots(Figure
(Figure7)7)show
showthe theaccuracy
accuracy and and loss
loss of the model
model on on the
the training
trainingand and
The plots (Figure 7) show the accuracy and loss of the model on the training and
validation data over the training
validation data over the training epochs. epochs.
validation data over the training epochs.
 • Graph
Graphofofaccuracy:
accuracy:TheThefirst
firstgraph
graphshowsshows how how the accuracy of of the
the model
modelincreases
increases
 Graph of accuracy: The first graph shows how the accuracy of the model increases
over
over the epochs. The training set has a steady increase in accuracy to over 95%. The
overthetheepochs.
epochs.The
Thetraining
trainingset sethas
has aa steady accuracyto
steady increase in accuracy toover
over95%.
95%.The The
validationsample
validation sample also
also shows
shows aahigh
high level
levelofof
accuracy,
accuracy,exceeding
exceeding90%,90%,which indicates
which indi-
validation sample also shows a high level of accuracy, exceeding 90%, which indi-
a good
cates generalization of theofmodel.
catesa agood
goodgeneralization
generalization ofthe themodel.
model.
 • Graph
Graphofoflosses:
losses:The
Thesecond
secondgraph
graphshows
Graph of losses: The second graph shows the
shows the reduction
reduction in
the reduction in
model
in model
modellossloss on
losson the
onthe training
thetraining
training
and
and validation data. There is a significant decrease in loss during the first epochs,
andvalidation
validationdata.
data.There
Thereisis aa significant
significant decrease
decrease inin loss
loss during
during the thefirst
firstepochs,
epochs,
which
which is an indicator that the model is learning well. The loss of the validation data
whichisisananindicator
indicatorthat
thatthe
themodel
model is is learning
learning well. The loss
well. The loss of
ofthe
thevalidation
validationdata data
also decreases, indicating that there is no overfitting.
also decreases, indicating that there is no
also decreases, indicating that there is no overfitting.overfitting.

Figure
Figure 7. 7. Trainingprogress
Training progressplots
plotsfor
forthe
themultimodal
multimodal early
early fusion
fusion model.
model.

Considering the
Considering the confusion
confusionmatrix
matrixforforthe
thevalidation
validation data (Figure 8),8),
data we cancan
see thatthat
the
Considering the confusion matrix for the validation data (Figure
(Figure 8),wewe canseesee that
model
the correctly
model predicted
correctly 138 samples
predicted fromfrom
138 samples classclass
0 (without depression)
0 (without and made
depression) 10 false
and made 10
the model correctly predicted 138 samples from class 0 (without depression) and made 10
predictions. For class
false predictions. 1 (with
For class depression),
1 (with the model
depression), correctly
the model classified
correctly 135 samples,
classified and
135 samples,
false predictions. For class 1 (with depression), the model correctly classified 135 samples,
ninenine
and samples werewere
samples classified incorrectly.
classified incorrectly.
and nine samples were classified incorrectly.

Figure 8.
Figure Confusion matrix
8. Confusion matrix for
for validation
validation data
data (early
(early fusion
fusion model).
model).
Figure 8. Confusion matrix for validation data (early fusion model).
Table 1 shows the obtained values of the metrics for evaluating the early fusion model
performance on the validation data. The model has an accuracy of 0.94 for class 0 and 0.93
Table 1 shows the obtained values of the metrics for evaluating the early fusion
Computation 2025, 13, 9 model performance on the validation data. The model has an accuracy of 0.94 for12class of 18 0
and 0.93 for class 1, which means that the model makes few errors among the predictions.
For both classes, the model demonstrates high recall values (0.93 and 0.94), which shows
for class 1, which means that the model makes few errors among the predictions. For both
that most samples in each class were correctly classified. The high f1-score values for both
classes, the model demonstrates high recall values (0.93 and 0.94), which shows that most
classes (0.94 and 0.93) indicate a balanced model that performs well in both precision and
samples in each class were correctly classified. The high f1-score values for both classes
recall. The area under the curve (AUC) score of 0.90 highlights moderate discriminative
(0.94 and 0.93) indicate a balanced model that performs well in both precision and recall.
power and strong correlation based on the Matthews correlation coefficient (MCC) of 0.87.
The area under the curve (AUC) score of 0.90 highlights moderate discriminative power
and strong correlation based on the Matthews correlation coefficient (MCC) of 0.87.
Table 1. Report on early fusion model performance metrics on validation data.
Table 1. Report on early fusion model performance metrics on validation data.
Class
Metrics
Non‐Depression Class Depression
Metrics
Precision 0.94 0.93
Non-Depression Depression
Recall 0.93 0.94
Precision 0.94 0.93
F1-score
Recall 0.94
0.93 0.93
0.94
Accuracy
F1-score 0.94 0.93 0.93
AUC
Accuracy 0.90
0.93
AUC
MCC 0.90
0.87
MCC 0.87
In general, the model demonstrates high classification accuracy for both classes,
In general,its
which indicates theeffectiveness
model demonstrates high
in solving classification
this task. accuracy for both classes, which
indicates its effectiveness in solving this task.
4.2. Performance of a Multimodal Late Fusion Network
4.2. Performance of a Multimodal Late Fusion Network
The late fusion model was trained for 50 epochs. The Adam optimizer was used to
The late fusion model was trained for 50 epochs. The Adam optimizer was used to
train the model with the learning rate set to 0.01, which was configured to train faster than
train the model with the learning rate set to 0.01, which was configured to train faster than
the previous early fusion model. In addition, the binary_crossentropy loss function, which
the previous early fusion model. In addition, the binary_crossentropy loss function, which
is is
widely used for binary classification, was applied. The accuracy metric was used to
widely used for binary classification, was applied. The accuracy metric was used to
evaluate
evaluatethe model’s
the model’sperformance.
performance.
Figure
Figure 9 showsthe
9 shows theimprovement
improvement inin accuracy
accuracy and reductionininerror
and reduction errorover
over5050 epochs
epochs
of of
training.
training.
 • Accuracy:
Accuracy:The Theaccuracy
accuracygraph
graph shows
shows an increase for
an increase forboth
bothtraining
trainingandandvalidation
validation
data, although there are fluctuations, especially in the validation sample,
data, although there are fluctuations, especially in the validation sample, which may which may
indicate
indicatesome
someproblems
problems in in generalizing
generalizingthe
the model.
model. TheThe final
final accuracy
accuracy of theofmodel
the model
for
forthe
thevalidation
validation data
data is about
is about 85%.
85%.
 • Loss:
Loss:The
Theloss
lossplot
plotshows
showsaagradual
gradual decrease for both
decrease for bothtraining
trainingand
andvalidation
validationdata,
data,
whichis is
which a positive
a positive signal.
signal. However,
However, again,
again, there
there areare
somesome fluctuations,
fluctuations, which
which may
may indicate the complexity of the data or the insufficient number
indicate the complexity of the data or the insufficient number of epochs for full sta-of epochs for
full stabilization.
bilization.

Figure
Figure 9. 9. Trainingprogress
Training progressplots
plotsfor
for the
the multimodal
multimodal late
latefusion
fusionmodel.
model.
Computation 2025,
Computation 2025, 13,
13, 9x FOR PEER REVIEW 13 of 18
13 of 18

Thus, both graphs show that the model has gradually learned to distinguish distinguish between
between
depression, and although there are some fluctuations, the overall trend indicates good
progress
progress and
and satisfactory
satisfactory accuracy.
accuracy.
At the same time, the training results of the late fusion model are somewhat worse
compared to the early fusion model. Although the validation validation accuracy
accuracy ofof the late fusion
model reaches about 85%, it is inferior to the accuracy of the early fusion
reaches about 85%, it is inferior to the accuracy of the early fusion model,model, which has
which
shown much
has shown moremore
much stable results.
stable results.
The confusion
confusion matrix
matrix(Figure
(Figure10)10)shows
shows thethe distribution
distribution of model
of model results
results onval-
on the the
validation sample.
idation sample. TheThe model
model correctly
correctly classified
classified 130samples
130 samplesinto
intoclass
class00(no
(no depression)
depression)
and 118 samples into class 1 (depression). However, there were errors: 18 samples in class
0 were misclassified as 1, and 26 samples in class 1 were misclassified as 0.

Figure
Figure 10. Confusion matrix
10. Confusion matrix for
for validation
validation data
data (late
(late fusion
fusion model).
model).

These results indicate

These results indicate some
some diﬃculties
difficulties in
in correctly
correctly classifying
classifying depression,
depression, but but overall
overall
the model showed good accuracy in detecting
the model showed good accuracy in detecting both classes. both classes.
Analyzing
Analyzing the the metrics
metrics report
report (Table
(Table 2),
2), we
we can
can note
note the
the following:
following:
For class 0 (without depression), the precision value
For class 0 (without depression), the precision value is 0.83, whichis 0.83, which indicates
indicates aa slightly
slightly
higher number of false positive predictions. In class 1 (with depression),
higher number of false positive predictions. In class 1 (with depression), the precision the precision
value
value isis higher—0.87,
higher—0.87, i.e.,i.e., the
the model
model predicts
predicts depression
depression somewhat
somewhat betterbetter without
without too too
many false positive predictions. The recall value for class 0 is 0.88,
many false positive predictions. The recall value for class 0 is 0.88, which means that the which means that
the
modelmodel recognizes
recognizes mostmost patients
patients without
without depression
depression quitequite
well. well. For 1,
For class class 1, the value
the recall recall
value is slightly lower at 0.82, indicating that there are some cases of depression
is slightly lower at 0.82, indicating that there are some cases of depression that the model that the
model
could notcould not For
detect. detect.
classFor class
0, the 0, thevalue
f1-score f1-score value
is 0.86, andisfor
0.86, and
class for
1, it is class 1, it means
0.84. This is 0.84.
This means that the model performs slightly better in detecting the
that the model performs slightly better in detecting the absence of depression than itsabsence of depression
than its presence.
presence. The overallTheaccuracy
overall accuracy of the
of the model is model is 0.85,
0.85, which is awhich
fairly is a fairly
good good
result givenresult
the
given the complexity of the depression classification task. The area under
complexity of the depression classification task. The area under the curve (AUC) score of the curve (AUC)
score of 0.78 highlights
0.78 highlights moderate moderate discriminative
discriminative power power and strong
and strong correlation
correlation based based
on theon the
Mat-
Matthews correlation coefficient
thews correlation coeﬃcient (MCC) of 0.70.(MCC) of 0.70.

Table 2. Report on late fusion model performance metrics on validation data.

Table 2. Report on late fusion model performance metrics on validation data.
Class
Class
Metrics
Metrics Non-Depression
Non‐Depression Depression
Depression
Precision 0.83 0.87
Recall 0.88 0.82
F1-score
F1-score 0.86
0.86 0.84
0.84
Accuracy
Accuracy 0.85
0.85
AUC 0.78
AUC 0.78
MCC 0.70
MCC 0.70
Computation 2025, 13, x FOR PEER REVIEW 14 of 18

Computation 2025, 13, 9 14 of 18

In general, the model performs well, although there is a slight diﬀerence in accuracy
between the two classes.
Thus, comparing
In general, the training
the model performsresults of the two
well, although developed
there models—early
is a slight and late
difference in accuracy
fusion—we
between thecantwosee that the early fusion model shows better results in terms of metrics. It
classes.
Thus, comparingprecision,
demonstrates higher recall,
the training and of
results f1-Score,
the two which indicates
developed that it is more
models—early andeﬀec-
late
tive in detecting
fusion—we depression.
can see that the early fusion model shows better results in terms of metrics. It
demonstrates higher precision, recall, and f1-Score, which indicates that it is more effective
4.3. Interpretation
in detecting of the Results of the Best Model
depression.
In order to interpret the results of the best model, the multimodal early fusion net-
4.3. Interpretation
work, of the
it is necessary to Results
conductoftesting
the Beston Model
a test dataset. Since the data of each participant
In order to interpret the results of
were divided into segments, the results for each the best model, the multimodal
participant early fusionto
must be summarized network,
obtain
ait single
is necessary
answertoregarding
conduct testing on a test
their status dataset.
(presence orSince
absencetheof
data of each participant were
depression).
dividedTheinto segments,
process the resultsthe
of summarizing forresults
each participant must be summarized
for each participant is performedtousing
obtainthea
single answer
method regardingsegment
of aggregating their status (presence or
probabilities. In absence
this case,ofthe
depression).
average value of segment
The process of summarizing the results for each participant
probabilities for each participant is used. This allows us to determine is performed using the
a single forecast for
method
the of aggregating
participant based on segment probabilities.
individual segments.InIfthis
thecase, the average
average valueexceeds
probability of segment
the
probabilities
threshold of 0.5, for the
eachconclusion
participant is is
madeused. This
that allows usis to
depression determine
present. Belowa single forecast
is an example
forthe
of thefunctions
participant based
used on individual
for this purpose: segments. If the average probability exceeds the
1. Aggregate results function:isfor
threshold of 0.5, the conclusion madeeachthat depression
participant, theisaverage
present.probability
Below is anvalue
example of
is cal-
the functions used for
culated based thisforecasts
on the purpose:for each segment.
1.
2. Aggregate
Model resultsfunction:
evaluation function:after
for each participant,
prediction on the the average
segments of probability valuethe
the test sample, is
calculated
results based on thetoforecasts
are aggregated obtain a for each segment.
prediction at the participant level, after which the
2. Modelisevaluation
model evaluatedfunction:
by metricsafter prediction
such on the
as accuracy, segments
error matrix,ofand
theclassification
test sample, the
re-
results
port. are aggregated to obtain a prediction at the participant level, after which the
model
This is evaluated
approach by us
allows metrics such as
to obtain accuracy,
more error matrix,
generalized resultsand
andclassification report.
correctly evaluate
This approach
the model’s allows at
performance us the
to obtain more generalized
participant level rather results
than atand
thecorrectly evaluate
individual the
segment
model’s performance at the participant level rather than at the individual segment level.
level.
The confusion matrix (Figure 11) of the model testing on unknownunknown data
data shows
shows that
that
the model correctly classified 26 participants without depression and 11 participants with with
Atthe
depression. At thesame
sametime,
time,the
the model
model made
made four
four errors
errors whenwhen participants
participants without
without de-
depression
pression were
were classified
classified as having
as having depression
depression andand
twotwo errors
errors whenwhen participants
participants withwith
de-
depression
pression were
were classified
classified as as healthy.
healthy.

Figure 11.
Figure 11. Confusion matrix for test data.

Based on
Based on the
the classification
classification report
report (Table
(Table 3)
3) from
from testing
testing on
on unknown
unknown data,
data, it
it can
can be
be
seen that the model achieved an overall accuracy of 85%. Furthermore, the area under
seen that the model achieved an overall accuracy of 85%. Furthermore, the area under the the
curve (AUC)
curve (AUC) score
score of
of 0.73
0.73 highlights
highlights moderate
moderate discriminative
discriminative power,
power, suggesting
suggesting that
that the
the
model can distinguish between the two classes better than random guessing but
model can distinguish between the two classes better than random guessing but leaves leaves
room for improvement. The Matthews correlation coefficient (MCC) of 0.68 reflects a
Computation 2025, 13, 9 15 of 18

substantial level of correlation between the predicted and actual classifications, providing a
robust evaluation of model performance even in cases of imbalanced datasets.

Table 3. Report on model performance metrics on test data.

Class
Metrics
Non-Depression Depression
Precision 0.83 0.87
Recall 0.88 0.82
F1-score 0.86 0.84
Accuracy 0.85
AUC 0.73
MCC 0.68

The results by class demonstrate the following metrics:

For class 0 (no depression):
• Precision: 0.93, which means that the model does a good job of correctly classifying
examples without depression.
• Recall: 0.87, which means that the model detects most examples of this class.
• F1-score: 0.90, which indicates a good balance between precision and recall.
For class 1 (with depression):
• Precision: 0.73, which indicates some errors in predicting this class (possibly due to
the smaller amount of data).
• Recall: 0.85, the model is good at detecting examples of depression.
• F1-score: 0.79, which still indicates relatively high efficiency.
Thus, the model demonstrates stable results with high accuracy for both classes,
although the accuracy for class 1 is slightly lower, which may indicate difficulties with
assessing the classification of depression due to the smaller amount of data for this class.
Nevertheless, the overall performance of the model on the test data is quite high, indicating
its potential in detecting depression based on multimodal information (audio and text).

5. Discussion
The experimental results of this research have demonstrated the potential of multi-
modal networks for depression detection. Specifically, the early fusion model has outper-
formed the late fusion model in terms of multiple evaluation metrics, including accuracy,
precision, recall, and f1-score. The early fusion model achieved an f1-score of 0.93 on the
validation set and 0.79 on the test set, compared to the late fusion model’s f1-score of 0.84
on validation data.
One of the key reasons for the superior performance of the early fusion model is the
immediate integration of audio and text data at the feature extraction stage. By combining
these modalities early, the model can learn joint feature representations, allowing it to
capture correlations between the audio and text modalities that might be missed when
processing them independently. This integration leads to a more holistic understanding
of the data, where both verbal (text) and non-verbal (audio) cues contribute to depression
detection. In addition, an important advantage of the early fusion model is the smaller
number of parameters, which reduces computational costs and training time, making it
more optimal for practical implementation.
However, the early fusion approach also has certain limitations. The early fusion
model demands more sophisticated handling of data preprocessing and input alignment
since both modalities must be preprocessed and fed into the network simultaneously.
Computation 2025, 13, 9 16 of 18

On the other hand, the late fusion model separates audio and text processing, allowing
for more specialized feature extraction for each modality. This modular approach offers
flexibility in learning modality-specific patterns but at the cost of not leveraging cross-
modality correlations as effectively as early fusion. While late fusion ensures that audio
and text features are treated independently, it loses valuable information that comes from
integrating them earlier in the process.
The practical implementation of depression detection models, especially those that use
multimodal approaches such as early and late fusion, must take into account a number of
ethical and privacy issues. In practice, the implementation of these models requires robust
safeguards to ensure patient confidentiality and avoid stigmatization.
When working with medical data, it is also important to pay attention to the imple-
mentation of secure data collection and storage protocols, anonymization of patient data,
and compliance with regulations such as GDPR or HIPAA. Additionally, it is worth noting
that the model results should serve as an advisory function and are not a full-fledged
diagnosis, i.e., they need additional validation by specialists (doctors) to prevent fault or
misinterpretation that may increase stigma.
Further improvements to the model can be made in several areas. First, additional
methods of data augmentation can be considered. This may help to reduce the problem
of limited data size. Secondly, to improve the model’s performance, more sophisticated
self-attention mechanisms, such as transformer or attention-based models, which can better
account for the interaction between text and audio data, should be integrated. We could also
consider using more advanced audio processing techniques, such as calculating additional
features that would better convey the emotional state of the speaker.

6. Conclusions
This research has demonstrated the effectiveness of multimodal networks for depres-
sion detection, specifically through the development and comparison of early and late
fusion models. The early fusion model, which integrates audio and text data at the feature
extraction stage, outperformed the late fusion model with higher accuracy and f1-score,
showcasing the advantage of early cross-modal interaction. Both models were designed
using advanced neural network techniques, including CNNs, Bi-LSTM layers, and attention
mechanisms, to capture relevant patterns from the audio and text modalities.
The key findings of this research highlight the importance of combining multiple
modalities to improve the robustness and accuracy of depression detection. The early
fusion model achieved an f1-score of 0.93 on validation data, while the late fusion model
scored 0.84. On the test set, the best-performing model (early fusion) achieved an f1-score
of 0.79 and an accuracy of 0.86.
The implications of this study extend beyond the immediate technical contributions:
1. Clinical Applications: The developed model provides a foundation for automated
depression screening systems. By identifying depressive signs from speech and text
data, it has the potential to assist clinicians in diagnosing depressive disorders early
and more efficiently. This can be particularly valuable in resource-constrained settings
where mental health professionals are limited.
2. Early Intervention: The system could empower individuals to monitor their psycho-
emotional states, encouraging them to seek help earlier when depressive symptoms
emerge. This aligns with public health goals of early intervention to mitigate the
progression of mental health disorders.
3. Integration into Broader Systems: Beyond clinical use, the model can be integrated
into wellness applications, virtual assistants, or workplace mental health monitoring
systems, fostering a more proactive approach to mental health care.
Computation 2025, 13, 9 17 of 18

Future work could focus on expanding the dataset size to include more diverse
populations and exploring different fusion architectures. The developed model can serve as
a basis for further development of automated depression screening systems that can detect
signs of depressive disorders in patients based on their speech and voice, helping doctors
and people themselves diagnose the disease at early stages. In addition, the system can be
used as part of integrated solutions for monitoring the psycho-emotional state of people.

Author Contributions: Conceptualization, M.N. and N.M.; methodology, O.B.; software, M.N.; vali-
dation, M.N. and N.S.; formal analysis, O.B.; investigation, M.N.; resources, O.B.; data curation, N.M.;
writing—original draft preparation, M.N.; writing—review and editing, O.B.; visualization, M.N.;
supervision, N.S.; project administration, N.S. All authors have read and agreed to the published
version of the manuscript.

Funding: This research received no external funding.

Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement: Dataset available on request from https://dcapswoz.ict.usc.edu/

(accessed on 28 November 2024).

Acknowledgments: The authors would like to thank the Armed Forces of Ukraine for providing the
security to perform this work. This work was only possible because of the resilience and courage
of the Ukrainian Army. The third author would like to acknowledge the financial support from the
British Academy for this research (RaR\100727).

Conflicts of Interest: The authors declare no conflicts of interest.

References
1. World Health Organization. Depressive Disorder (Depression). 2023. Available online: https://www.who.int/news-room/fact-
sheets/detail/depression (accessed on 22 December 2024).
2. Vázquez-Romero, A.; Gallardo-Antolín, A. Automatic detection of depression in speech using ensemble convolutional neural
networks. Entropy 2020, 22, 688. [CrossRef] [PubMed]
3. Yin, F.; Du, J.; Xu, X.; Zhao, L. Depression detection in speech using transformer and parallel convolutional neural networks.
Electronics 2023, 12, 328. [CrossRef]
4. Miao, X.; Li, Y.; Wen, M.; Liu, Y.; Julian, I.N.; Guo, H. Fusing features of speech for depression classification based on higher-order
spectral analysis. Speech Commun. 2022, 143, 46–56. [CrossRef]
5. Zhao, Y.; Liang, Z.; Du, J.; Zhang, L.; Liu, C.; Zhao, L. Multi-head attention-based long short-term memory for depression
detection from speech. Front. Neurorobotics 2021, 15, 684037. [CrossRef]
6. Milintsevich, K.; Sirts, K.; Dias, G. Towards automatic text-based estimation of depression through symptom prediction. Brain
Inform. 2023, 10, 4. [CrossRef] [PubMed]
7. Park, J.; Moon, N. Design and implementation of attention depression detection model based on multimodal analysis. Sustainabil-
ity 2022, 14, 3569. [CrossRef]
8. Mao, K.; Zhang, W.; Wang, D.B.; Li, A.; Jiao, R.; Zhu, Y.; Wu, B.; Zheng, T.; Qian, L.; Lyu, W.; et al. Prediction of depression
severity based on the prosodic and semantic features with bidirectional LSTM and time distributed CNN. IEEE Trans. Affect.
Comput. 2022, 14, 2251–2265. [CrossRef]
9. Iyortsuun, N.K.; Kim, S.-H.; Yang, H.-J.; Kim, S.-W.; Jhon, M. Additive cross-modal attention network (ACMA) for depression
detection based on audio and textual features. IEEE Access 2024, 12, 20479–20489. [CrossRef]
10. DAIC-WOZ Database. Available online: https://dcapswoz.ict.usc.edu/ (accessed on 22 December 2024).
11. Roberts, L. Understanding the Mel Spectrogram. 2020. Available online: https://medium.com/analytics-vidhya/understanding-
the-mel-spectrogram-fca2afa2ce53 (accessed on 22 December 2024).
12. Wikipedia Contributors. Mel-Frequency Cepstrum. 2003. Available online: https://en.wikipedia.org/w/index.php?title=Mel-
frequency_cepstrum&oldid=1233509682 (accessed on 22 December 2024).
13. Yang, J.; Luo, F.-L.; Nehorai, A. Spectral contrast enhancement: Algorithms and comparisons. Speech Commun. 2003, 39, 33–46.
[CrossRef]
Computation 2025, 13, 9 18 of 18

14. Basystiuk, O.; Melnykova, N.; Rybchak, Z. Multimodal Learning Analytics: An Overview of the Data Collection Methodology. In
Proceedings of the 2023 IEEE 18th International Conference on Computer Science and Information Technologies (CSIT), Lviv,
Ukraine, 19–21 October 2023. [CrossRef]
15. Jaafar, N.; Lachiri, Z. Multimodal fusion methods with deep neural networks and meta-information for aggression detection in
surveillance. Expert Syst. Appl. 2022, 211, 118523. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.