Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
26 views11 pages

Design and Implement Deepfake Video Detection Usin

This study presents a deepfake video detection system utilizing VGG-16 and Long Short-Term Memory (LSTM), achieving 96.25% accuracy with optimal parameters. The research compares various models including VGG-19 and ResNet-101, highlighting VGG-16's superior performance on the Celeb-DF dataset. The findings emphasize the effectiveness of combining transfer learning techniques with LSTM for enhanced detection of deepfake videos.

Uploaded by

newzguy059
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views11 pages

Design and Implement Deepfake Video Detection Usin

This study presents a deepfake video detection system utilizing VGG-16 and Long Short-Term Memory (LSTM), achieving 96.25% accuracy with optimal parameters. The research compares various models including VGG-19 and ResNet-101, highlighting VGG-16's superior performance on the Celeb-DF dataset. The findings emphasize the effectiveness of combining transfer learning techniques with LSTM for enhanced detection of deepfake videos.

Uploaded by

newzguy059
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Wiley

Applied Computational Intelligence and Soft Computing


Volume 2024, Article ID 8729440, 11 pages
https://doi.org/10.1155/2024/8729440

Research Article
Design and Implement Deepfake Video Detection Using VGG-16
and Long Short-Term Memory

Laor Boongasame ,1,2 Jindaphon Boonpluk ,2 Sunisa Soponmanee ,2


Jirapond Muangprathub ,3 and Karanrat Thammarak 4
1
Business Innovation and Investment Laboratory: B2I-Lab, Faculty of Science,
King Mongkut’s Institute of Technology Ladkrabang, Bangkok, 10520, Tailand
2
Department of Mathematics, Faculty of Science, King Mongkut’s Institute of Technology Ladkrabang, Bangkok, 10520, Tailand
3
Faculty of Science and Industrial Technology, Prince of Songkla University, Surat Tani Campus, Surat Tani, 84000, Tailand
4
Department of Computer Engineering and Electronics, School of Engineering and Technology, Walailak University,
Nakhon Si Tammarat, 80160, Tailand

Correspondence should be addressed to Karanrat Tammarak; [email protected]

Received 24 February 2023; Revised 17 May 2024; Accepted 13 June 2024

Academic Editor: Youcef Belkhier

Copyright © 2024 Laor Boongasame et al. Tis is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
Tis study aims to design and implement deepfake video detection using VGG-16 in combination with long short-term memory
(LSTM). In contrast to other studies, this study compares VGG-16, VGG-19, and the newest model, ResNet-101, including LSTM.
All the models were tested using Celeb-DF video dataset. Te result showed that the VGG-16 model with 15 epochs and 32 batch
sizes had the highest performance. Te results showed that the VGG-16 model with 15 epochs and 32 batch sizes exhibited the
highest performance, with 96.25% accuracy, 93.04% recall, 99.20% specifcity, and 99.07% precision. In conclusion, this model can
be implemented practically.

1. Introduction deceptively realistic content. Recent technological ad-


vancements have simplifed the creation of deepfake videos
Te artifcial intelligence (AI) and machine learning (ML) that replicate human faces [10] and mimic speech,
technologies have grown exponentially. In recent years, they movement patterns [11], and tones of voice with re-
have become more accessible and easier to use, indicating markable accuracy [12]. Consequently, these manipulated
their continued expansion and growth. AI is widely applied videos have entered social media platforms that play an
in various felds, including medicine [1], education [2], ever-growing role in daily communication and information
business [3], industry [4, 5], and social media [6]. However, sharing. Although certain deepfake content is created for
the risk of criminal exploitation increases as the capabilities entertainment purposes [13], the potential for malicious
and applications of AI technology expands. Caldwell et al. use is considerable. Misuses of deepfake include imper-
[7] conducted a study to identify the potential criminal sonation scams such as criminals posing as law enforce-
applications of AI and related technologies that resulted in ment ofcers during video calls to deceive victims into
the construction of a taxonomy comprising approximately transferring money [14]. In addition, a Europol report [15]
18 threat categories. One of the most concerning emerging highlighted the potential illicit applications of deepfake
hazards is the spread of deepfake technologies. technology, including CEO fraud, evidence tampering, and
Deepfakes [8, 9] use AI and ML techniques to seam- creation of nonconsensual pornography, highlighting the
lessly merge and overlay existing footage onto original urgent requirement for efective detection and prevention
images or videos, thereby creating nearly indistinguishable, measures.
2 Applied Computational Intelligence and Soft Computing

Techniques such as generative adversarial networks networks (GANs), convolutional neural networks (CNN),
(GANs), convolutional neural networks (CNNs), and and common fake feature networks (CFFNs). Te second
common fake feature networks (CFFNs) are widely used to category is fake video detection that is divided into two
detect deepfakes. Te complexity of video data compounds subcategories: temporal features across video frames and
the difculty of detecting deepfakes in real time and with visual artifacts within video frames. Based on temporal
high accuracy in video streaming. Tis study examined the features across video frames, this group includes CNN, long
complex problem of detecting deepfakes in video formats short-term memory (LSTM), and long-term recurrent
that are frequently encountered on social media platforms. convolutional networks (LRCNs); most methods use the
Te transfer learning model and long short-term memory spatiotemporal features of video streams. Te video is
(LSTM) networks are well suited for addressing the temporal manipulated frame-by-frame to detect deepfakes. Te visual
aspects of video data. Te LSTM methods in combination artifacts within video frames subgroup focuses on the ap-
with VGG-16, VGG-19, and ResNet-101 on Celeb-DF video proach that decomposes videos into frames and explores
dataset were determined. Tis study ensured the continued visual artifacts within single frames to obtain discriminant
efectiveness of detection techniques by fostering innovation features that are then distributed into a deep or shallow
and diverse manipulation models in the ongoing battle classifer to diferentiate between fake and authentic videos.
against emerging deepfake threats. Te following sections
present a detailed account of the methodology used in the
study, showing the manner in which the combination of 2.2. Transfer Learning. Transfer learning [19] is a machine
LSTM and established transfer learning techniques can learning technique that uses a trained CNN as a structure
signifcantly enhance the accuracy and efciency of deepfake and provides new information for teaching. Currently,
detection. Tis study emphasizes its relevance in the ever- transfer learning is typically applied to image classifcation
changing landscape of detecting synthesized video content, and prediction such as classifying texts and fower types. In
underscoring its commitment to be at the forefront of the addition, the detection of blinking eyes, smiling, and facial
curve in countering deepfake proliferation. distortion can use diferent models such as VGG [20],
Te remainder of the paper is organized as follows: ResNet [21], and Inception-V3 [22]. Each model exhibits
Section 2 presents a literature review. Section 3 presents the diferent performance depending on the analyzed dataset. In
research methodology, including the operational process, addition, only a few researchers have utilized transfer
dataset, data analysis, and model. Section 4 presents the learning in their experiments. However, no research has
results of the proposed method and a discussion. Te last compared transfer learning in new formats such as VGG-16,
section presents conclusions and suggestions for future VGG-19 [23], and ResNet-101.
research. Te VGG-16 neural network was developed by re-
searchers in the visual geometry group at Oxford [24].
2. Background and Related Works VGG-16 was among the best-performing architectures in the
ILSVRC challenge 2014, with a top-5 classifcation error of
2.1. Deepfake Detection. Deepfake [16] is derived from two 7.32%. VGG-16 is a transfer learning method of worldwide
terms: deep learning and fake learning. Deep learning is interest. In most use cases, VGG-16 is used for image rec-
a branch of AI that learns deeply about the features of ognition or classifcation [25], image detection and locali-
a specifc person such as face, skin tone, body shape, ges- zation [26, 27], and image embedding vectors [28]. Te input
tures, movement, speech, and tone of voice [17]. Deepfake of the VGG-16 neural network is an image with dimensions of
refers to the falsifcation of the identity of a person by AI (224, 224, 3). Te pooling layer follows the frst through ffth
such as editing the face of a celebrity into a pornographic convolutional layers with the maximum value (max pooling).
flm, video spoofng in which people speak diferently from Te endpoint is then sent to the entire link layer with 4,096
that in the actual video, or realistic sound spoofng. People nodes, and the last layer is assigned 1,000 nodes. Results are
who are unfamiliar cannot discern these technologies with calculated using the softmax stimulation function.
the naked eye. Te procedure for creating a deepfake begins Te VGG-19 neural network [29] has a structure similar
with the encoder learning to encode the properties of each to that of the VGG-16 neural network; the VGG-19 neural
image such as facial expressions, facial shape, skin tone, and network consists of 19 layers, irrespective of the top layer.
lighting. Te encoder output is then presented to the decoder Te pooling and 19-layer networks are divided into fve
network, where the decoder learns to reconstruct the image groups of convolutional layers. Te max pooling layer fol-
by extracting the essential features. After training, the face is lows the frst through ffth convolutional layers. Ten, it is
swapped, and a new image is created using the properties of end-pointed to the entire link layer of 4,096 nodes. Sub-
the original image. Tis technique is repeated for each frame sequently, the fnal pool of the complete pooling layer is set
to replicate a video. to 1000 nodes by calculating the result value using the
Deepfake detection involves overlaying and merging softmax excitation function. Te main diference between
images or videos from one batch onto another. Nguyen et al. VGG-16 and VGG-19 lies in the number of layers and
[18] divided deepfake detection methods into two categories variety of images, where VGG-19 has more layers. It is
and four subcategories. Te frst category is fake image trained using more than 1,000,000 images. It can also group
detection that includes handcrafted and deep features-based images from more than 1,000 groups, such as people, ani-
technique subcategories such as generative adversarial mals, and objects.
Applied Computational Intelligence and Soft Computing 3

Te residual network is an artifcial neural network techniques to enhance the accuracy of the model. For ex-
launched in 2015 by He et al. [30]. A ResNet was created to ample, Stanciu and Ionescu [33] applied facial features and
solve the vanishing gradient problem. Tis method adds a CNN-LSTM model to implement deepfake video de-
a shortcut to the neural network to bypass layers. Certain tection. Güera and Delp [34] used an RNN to classify videos
parameters are reduced to zero when the model learns by subjected to manipulation using various transfer learning
repeatedly updating them during the process. In particular, techniques [23, 35] for better detection of fake faces. Table 1
those in the frst layer are closer to the input layer, resulting lists the research on deepfakes. It can be classifed into CNNs
in a loss function value for the next parameter update. It and RNNs. CNN can be divided into several types such as
changed only slightly; therefore, the model was not as VGG and ResNet. Performance testing uses standard met-
ef2fcient as expected. Te most widely accepted models in rics such as accuracy rate, precision, recall, F1-score,
the ResNet family are ResNet-34, ResNet-50, and Res-Net- and AUC.
101. ResNet-101 is a CNN that is 101 layers deep. It can According to Table 1, related research has shown that
classify images into 1000 object categories such as dogs, cats, transfer learning techniques can be used in several ways.
and pencils. Te network has an image input size of 224-by- For example, Li and Lyu [27] applied VGG-16, ResNet-50,
224. ResNet-101, and ResNet-152 to detect face warping ar-
tifacts. Chang et al. [36] used an LSTM and NA-VGG to
detect fake faces in videos. In 2021, Rosli et al. [42]
2.3. Long Short-Term Memory (LSTM). LSTM [31, 32] is
presented a comparison 50 multitransfer learning model
a particular recurrent neural network (RNN) that is capable of
for a deepfake image recognizer using VGG-19 and
learning long-term dependency. LSTM is well suited for
ResNet-50. Notably, the use of deep learning techniques in
classifying, processing, and predicting based on time series
combination with diferent transfer learning techniques
data. Moreover, it is more efcient because it contains
has a high accuracy of 80–100% that varies depending on
a standard unit comprising a cell, an input gate, output gate,
the dataset used. Te commonly used datasets include
and a forget gate. A memory cell state can declare which
Celeb-DF [48, 49], DFDC [8], and FaceForensics++ [49].
information is essential to remember or eliminate. Within an
Te current research often combines CNN and RNN
LSTM network, various functions are used to create a par-
techniques. For example, Kaur et al. [39] compared
ticular function such as data reading, data entry, data updating,
several methods such as face-warping features + CNN,
and information loss. Tese functions work similarly to gates
CNN + GRU, and CNN + RNN. Güera and Delp [34]
for assigning incoming data in the desired direction. LSTM
applied a CNN and RNN to the HOHA dataset with high
can delete or add data to a cell state that is carefully controlled
accuracy of 97.1%. According to Elpeltagy et al. [46], the
by this structure. In a cell of an LSTM network, the gate is
study proposed the CQT XceptionNet and GRU for
named forget gate and is used to decide whether the cell should
detecting deepfakes on the FakeAVCeleb multimodal
forget the information from the previous timestamp. Fol-
video dataset with 97.52% accuracy.
lowing is the equation for the forget gate:
Spatiotemporal convolutional networks that use spatial
ft � σ Xt × Ut + Ht−1 × Wt 􏼁, (1) information from interframe temporal information also
have diferent uses. For example, Zhang et al. [43] used fxed-
ft � σ Xt × Ut + Ht−1 × Wt 􏼁, (2) length frame volumes collected from a video to feed into
a three-dimensional CNN (3DCNN) to detect whether faces
where Ct−1 × ft � 0 . . . if ft � 0 is forget everything and were real or fake. A 3D-attentional inception network was
Ct−1 × ft � Ct−1 . . . if ft � 1 is forget nothing. Xt is the in- suggested by Lu et al. [44] as a method for deepfake de-
put to the current timestamp. Ut is the weight associated tection. De Lima et al. [45] created a performance bench-
with the input. Ht−1 is the hidden state of the previous mark for spatiotemporal convolutional methods using the
timestamp. Wt is the weight matrix associated with hidden Celeb-DF dataset that included RCN, R3D, MC3, R2Plus1D,
state. If ft � 0, then the network will forget everything. If and I3D. Tis study demonstrated that R3D outperformed
ft � 1, then the network will forget nothing. the other networks. Moreover, Ismail et al. [47] proposed the
Te other gates are input and output gates. Te input combination of CNN, HOG, XceptionNet, and GRUs with
gate is used to quantify the importance of the new in- Celeb-DF and FaceForensics++ (c23). Te result of AUROC
formation carried by the input. Te output gate predicts the was 95.53%.
current output from the hidden state in the hidden state of
long-term memory. Te formula for these gates is as follows: 3. Methods
it � σ Xt × Ui + Ht−1 × Wi 􏼁. (3)
Figure 1 presents an overview of the operation, beginning
with providing the video data consisting of real and fake
ot � σ Xt × Uo + Ht−1 × Wo 􏼁. (4) videos to the data preparation phase. In this phase, the video
is converted into sequenced image frames. Ten, each frame
is cropped to show only the face. Subsequently, the prepared
2.4. Literature Review. Deepfake detection can be accom- data are entered into the model by dividing them into two
plished using deep learning methods such as RNN, CNN, sets: train and test. Te result is a prediction of whether the
and LSTM. In addition, many studies have combined video is real or fake.
4

Table 1: Methodology and performance of related works on deepfake detection.


Method Dataset Performance Reference
VGG network based on noise and image augmentation
Celeb-DF AUC � 85.7% Chang et al. [36]
(NA-VGG)
XceptionNet and VGG-16 DFFD, UADFV, and Celeb-DF AUC � 99.67% Dang et al. [26]
Accuracy � 96.5%
Deep dual-level network Celeb-DF (v2) and FaceForensics++ Pu et al. [37]
AUC � 98.9%
CNN such as VGG-16, ResNet-50, ResNet-101, and ResNet-152 UADFV and deepfakeTIMIT AUC � 97.4% Li and Lyu [27]
CNN and RNN HOHA Accuracy � 97.1% Güera and Delp [34]
LSTM and Inception-ResNet-V2 FaceForenscis and Celeb-DF Accuracy � 91.48% Yadav et al. [38]
Stanciu and Ionescu
Face warping features + CNN and CNN + RNN FaceForensics++and CelebDF AUC � 97.06%
[33]
Face warping features + CNN, CNN + RNN and CNN + LSTM Celeb-DF DFDC ground truth Accuracy � 96.46% Kaur et al. [39]
DF-TIMIT
UADFV
CNN + LSTM Celeb-DF AUC � 99.00% Rowan and Pears [40]
DFDC preview
FaceForensics++
AUC � 90.62%
YOLO + CNN + XGBoost Celeb-DF-FaceForencics++(c23) Ismail et al. [41]
Accuracy � 90.73%
VGG-19: accuracy � 94.65%; ResNet-50:
VGG-19 and ResNet-50 Te frst image forensic challenge in 2013 Rosli et al. [42]
accuracy � 95.08%
AUC � 88.83%
TD-3DCNN Celeb-DF Zhang et al. [43]
Accuracy � 81.08%
I3D att unifAug Celeb-DF AUC � 82.78% Lu et al. [44]
RCN
AUC � 74.87%
Accuracy � 76.25%
R2Plus1D
AUC � 99.43%
Accuracy � 98.07%
RCN
I3D
I3D
Celeb-DF AUC � 97.59% De Lima et al. [45]
MC3
Accuracy � 92.28%
R3D
MC3
AUC � 99.30%
Accuracy � 97.49%
R3D
AUC99.73%
Accuracy � 98.26%
Applied Computational Intelligence and Soft Computing
Table 1: Continued.
Method Dataset Performance Reference
Accuracy � 97.52%
Applied Computational Intelligence and Soft Computing

AUROC � 98.43%
Precision � 97.66%
CQT XceptionNet and GRU FakeAVCeleb multimodal videos dataset Elpeltagy et al. [46]
Recall � 98.04%
F1-score � 97.66%
Specifcation � 97.30%
AUROC � 95.53%
CNN + HOG + XceptionNet + GRUs CelebDF-FaceForencics++ (c23) Ismail et al. [47]
Running time (h) � 13.31
5
6 Applied Computational Intelligence and Soft Computing

Input phase Preparing phase Modeling Phase Output phase

Train
Real

Model
Fake

Test
Real and Fake
VDOs C
Coverting d
vdo Face cropping Separate dataset Transfer
to frames learning+ LSTM Output

Figure 1: Operation overview.

3.1. Dataset and Data Analysis. Tis study compiled a col-


lection of Celeb-DF video datasets available at https://
paperswithcode.com/dataset/celeb-df. Te Celeb-DF video TN FP

0
dataset is a large dataset containing both original and deepfake

True Label
videos. Tis study generated 240 datasets that were divided
into two types: (1) 119 original videos and (2) 121 synthesized
videos. Te video-to-image sequence and face detection
process provided a visual dataset for learning 2,400 images, FN TP

1
including actual and fake images. Tese image collections
were then divided into 1,920 learning and 480 test datasets.
To determine the efcacy of the model, we used a con- 0 1
fusion matrix, a fundamental and indispensable instrument Predicted Label
for machine learning and classifcation tasks. Te primary
Figure 2: Confusion matrix confguration value.
purpose of this investigation was to determine crucial per-
formance metrics, such as accuracy, recall (completeness),
specifcity, and precision. Te matrix is shown in Figure 2. Te next step is face cropping. Captured images were
Tus, obtained from video sequencing to detect faces and crop the
True positive (TP): Te model predicts that the image is image to obtain only the face area to prepare the image for
a real (1), and the actual label is a real (1). the next step. Te face was cropped using the Python
True negative (TN): Te model predicts that the image MTCNN library. Figure 4 shows an example of image
is a fake (0), and the actual label is a fake (0). cropping using the MTCNN library in Python. After video
conversion and face cropping, 2,400 images were entered
False positive (FP): Te model predicts that the image is into the model consisting of 1,920 training and 480 test sets.
a real (1), and the actual label is a fake (0).
False negative (FN): Te model predicts that the image
is a fake (0), and the actual label is a real (1). 3.4. Implementing Models. In this study, the VGG-16, VGG-
19, and ResNet-101 models were applied in combination
with LSTM to analyze deepfake VDO. Te objective is to
3.2. Implementation Environment. In this study, Google examine the performance of each model and propose
Colab was used to implement the aforementioned methods. modifcations for future deepfake videos. Owing to the
Using a Turing-based Tesla T4 GPU as the backend of the limited size of each model, all photographs entered were
Google Compute Engine with Python 3.10, two folders were resized to (128 × 128) dimensions. Te entered data were
created to hold real and fake datasets. Te RAM used was sent to the feature extraction network, where the features
12.7 GB, and the total disk capacity was 78.2 GB. In addition, were extracted. Te extracted features were then categorized
a preprocessing fle was created to convert VDO and crop using a classifcation network. Each of these transfer learning
faces. Tree fles were created to implement the model: models was used to generate the input feature vectors for the
LSTM with VGG-16, LSTM with VGG-19, and LSTM with LSTM model. Consequently, the classifcation network was
ResNet-101. eliminated, and the output vector from the fnal max-
pooling layer was selected as the input vector for the
3.3. VDO Converting and Face Cropping. Te video was LSTM. Te structure of the LSTM model used in this study is
converted into a sequence of images to detect only the facial shown in Figure 5.
area. Tis study set a frame rate per an image at 0.5 s and In Figure 5, Xi represents the i-th time step of all the time
limit at ten image frames per video. Figure 3 shows a preview series. Te model used time steps set to a value of 10 and
video converted into ten frames. contained a binary classifcation task with two classes: Class
Applied Computational Intelligence and Soft Computing 7

0 0 0 0 0 0
100 100 100 100 100 100
200 200 200 200 200 200
300 300 300 300 300 300
400 400 400 400 400 400

0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100200300400500 0 100 200 300 400 500
0 0 0 0
100 100 100 100
200 200 200 200
300 300 300 300
400 400 400 400

0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500

Figure 3: Example of converting VDO to frames.

0
50

100
100

200 150

300 200

400 250

0 100 200 300 400 500 0 50 100 150 200

Figure 4: Example of image cropping using the MTCNN library in Python.

0
(False)
X1 h1 Dropout
VGG-16 X2 h2
VGG-19 X3 h3 Softmax
LSTM
ResNet 101 … …
Xi ht ReLU
1
(Real)

Figure 5: Combination fow between transfer learning models and LSTM.

1 representing the real class and Class 0 representing the 4. Experiment Results and Discussion
false class. Te number of hidden units per state was selected
such that the model could efectively capture complex shapes Te model performance evaluation was divided into two
while avoiding overftting. Te selected number of hidden phases: training and testing. Te training phase bench-
units was 1024. Te fnal hidden state, ht , was then fed into marked the model using LSTM in combination with
the ReLU function to include a nonlinear relationship be- transfer learning methods including VGG-16, VGG-19,
tween the characteristics of the hidden state. Prior to the and ResNet-101 to select the most efective model for
fully connected layer, a dropout layer was used with detecting fake faces in the next testing phase. Tis phase
a dropout rate of 0.1 for regularization purposes. Weight defned three evaluation factors: accuracy, loss function,
decay must not be used in this layer because it may hinder and learning time. Te testing phase validated the selected
the ability of the model to identify time-dependent patterns. model against the test dataset and used the accuracy,
Te fully connected layer consisted of two output neurons. precision, recall, and specifcity as the evaluation pa-
Tese neurons were then subjected to a softmax function rameters. During the training phase, the model perfor-
that assigns a probability to each class in the output layer. mance was validated by determining its accuracy, loss
Table 2 presents the parameter confgurations used to im- function, and learning time to compare the performance
plement the model. of each model, as listed in Table 3.
8 Applied Computational Intelligence and Soft Computing

Table 2: Initial parameters for implementing the model.


Parameter Value
Time steps 10
Number of class (real � 1, fake � 0) 2
Input dimension 128,128,3
Dropout rate 0.1
Recurrent dropout 0.1
Hidden layer Dense(1024, activation � ReLU)
Epoch 15
Batch size 32
Output layer Dense (number of class, activation � softmax)

Table 3: Results of model performance.


Transfer learning Epoch Batch size Accuracy (%) Loss function (%) Learning time (min)
15 32 97.01 8.60 4.22
VGG-16 + LSTM
15 64 95.38 11.65 8.20
15 32 96.29 9.88 9.45
VGG-19 + LSTM
15 64 93.23 17.09 4.37
15 32 42.46 81.32 4.35
15 64 76.24 50.17 4.13
50 32 89.32 30.97 13.53
ResNet-101 + LSTM
50 64 88.54 30.86 25.13
100 32 92.45 19.54 27.08
100 64 91.02 23.15 25.42

According to Table 3, Te VGG-16 combined LSTM employing LSTM in conjunction with VGG-16 required
technique that was trained with 15 epochs and 32 batch sizes considerably shorter training times of 4.22 min and
had the highest accuracy of 97.01%, a loss function of 8.60%, 27.08 min, respectively, as presented in Table 3.
and learning time of 4.22 min. Te accuracy results for Te efectiveness of the proposed model was evaluated by
VGG-16 and VGG-19 were similar in both situations. combining the LSTM with the VGG-16 architecture for
However, when trained with batch sizes of 64 and 15 epochs, specifc hyperparameters (15 epochs and a batch size of 32).
the models were slightly less accurate than the previous Te model was evaluated using a test dataset comprising 480
training (95.38% and 93.23%, respectively). Evidently, batch images obtained from the Celeb-DF dataset. Te perfor-
sizes of 32 and 15 epochs were suitable parameter values for mance of the model was assessed using four key parameters:
the aforementioned models. However, ResNet-101 had accuracy, precision, recall, and specifcity, to determine its
a lower accuracy (42.46) compared with the other models efcacy. Te actual and predicted values for each quadrant of
when it was trained for 15 epochs with 32 batch sizes. In- the confusion matrix are listed in Table 4.
deed, the accuracy increased when the batch size was in- Te results showed that this model could efectively
creased to 64, yielding an accuracy rate of 76.24%. detect the fake faces with a high accuracy of 96.25%, recall of
Considering the noteworthy enhancement in accuracy 93.04%, specifcity of 99.20%, and precision of 99.07%, as
observed in the model employing LSTM with ResNet-101 as presented in Table 5.
the batch size increased from 32 to 64, the model was further
examined by adjusting the epoch values to 50 and 100 to
achieve compatibility with the specifed batch sizes. Te 4.1. Discussion of Performance Compared to Relevant
experimental results indicated that increasing the number of Research. Tis section focuses on seven studies that used
epochs to 50 and 100 improved the accuracy of the model. similar techniques and examines their results, methods, and
When trained with 100 epochs and 32 batch sizes, the model related research. In addition, the performance was assessed
yielded the most accurate results at 92.45%. However, when using the same public dataset, that is, Celeb-DF, as listed in
trained with 100 epochs and 64 batch sizes, the accuracy Table 6.
decreased. Te model demonstrated strong performance Te results showed that the proposed method had
when applied to larger batch sizes and a greater number of a higher accuracy than other relative models tested with the
epochs. Te performance of the model using VGG-16 and same standard dataset. Deep dual-level networks [37] had
VGG-19 was satisfactory when applied to small batch sizes the nearest accuracy at the video level at 96.5%; however, it
and a reduced number of epochs. However, the LSTM with was slightly decreased to 94.8% at the frame level. Kaur et al.
the ResNet-101 model exhibited a lower accuracy than the [39] proposed a C-LSTM model that combines a CNN and
LSTM with the VGG-16 and VGG-19 models. Te model LSTM. Te accuracy rate was 96.46% with a learning time of
Applied Computational Intelligence and Soft Computing 9

Table 4: Confusion matrix generated using VGG-16 + LSTM model with batch sizes of 32 and epochs of 15.
Predicted fake Predicted real
Actual fake 248 (TN) 2 (FP)
Actual real 16 (FN) 214 (TP)

Table 5: Evaluation results of the VGG-16 + LSTM model with batch sizes of 32 and epochs of 15.
Model Accuracy (%) Recall (%) Specifcity (%) Precision (%) Loss function (%)
VGG-16 + LSTM 96.25 93.04 99.20 99.07 6.54

Table 6: Performance results of the proposed methods and reviewed articles using the same Celeb-DF dataset.
Models Accuracy (%) Precision Recall
C-LSTM [39] 96.46 99.62 93.91
Deep dual-level networks [37] 96.50 95.3 98.3
LSTM and Inception-ResNet [38] 91.48 N/A N/A
CNN + VGG-19 [42] 94.65 N/A N/A
CNN + ResNet-50 [42] 95.08 N/A N/A
YOLO + CNN + XGBoost [41] 90.73 0.87 0.85
TD-3DCNN [43] 81.08 68.98 99.41
RCN [45] 76.25 N/A N/A
R2Plus1D [45] 98.07 N/A N/A
I3D [45] 92.28 N/A N/A
MC3 [45] 97.49 N/A N/A
R3D [45] 98.26 N/A N/A
LSTM + VGG-16 97.01 99.07% 9 .04%
Te methods’ performance values in this article are being bold to set them apart from methods’ performance values in other articles.

56.91 min, which was 52.69 min more than that of the 5. Conclusions
proposed method. In addition, it had a higher accuracy than
the models reported by Rosli et al. [42] and Ismail et al. [41] Tis study used LSTM with transfer learning models, in-
that used a CNN in combination with other techniques such cluding VGG-16, VGG-19, and ResNet-101, to determine
as XGBoost, VGG-16, and ResNet-50. a suitable deepfake detection model. Notably, the VGG-16
Te predictive capabilities of various models were model had 97.01% accuracy and 8.60% loss function, with
compared using recall and precision values. Te proposed a total learning time of 4.22 min. Tis was higher than those
model had recall and precision values of 93.04 and 99.07, of the other models that required the least time to learn. In
respectively. Tese values were compared with those in the addition, this study implemented LSTM with the VGG-16
study by Kaur et al. [39] that surpassed the values of the model to detect fake faces using a standard dataset. Te
competing models. In conclusion, the proposed model had result showed that the model achieved 96.25% accuracy,
relatively high prediction accuracy for efectively diferen- 93.04% recall, 99.20% specifcity, and 99.07% precision. In
tiating between real and fake faces. conclusion, this model can be practically implemented. In
Yadav et al. [38] presented the LSTM and Inception-ResNet the future, this study will focus on reducing the learning time
models that achieved 84.75% and 91.48% accuracy for 20 and 40 and increasing the detection accuracy by adding high-quality
epochs, respectively. Tis is consistent with the test results datasets to train the model. Tis model will continue to
presented in Table 3, where LSTM + ResNet-101 provides signifcantly improve cyber security and crime prevention
a higher accuracy when the epoch is confgured at a high rate. apps to support a sustainable life.
Te proposed method outperformed other methods such
as TD-3DCNN [43], RCN [45], and I3D [45] in terms of Data Availability
accuracy.
However, this model required a long learning time and its Te study compiled a collection of Celeb-DF video datasets
accuracy was slightly reduced when tested with actual detection. available at https://paperswithcode.com/dataset/celeb-df.
Te model might be efective when it is extended to develop Previously reported deepfake VDO data were used to
various applications, particularly those running on devices with support this study and are available at https://
lightweight processors such as mobile and embedded systems. paperswithcode.com/dataset/celeb-df.
10 Applied Computational Intelligence and Soft Computing

Disclosure [13] T. Guardian, “A.I. Don’t want to upset people: tom cruise
deepfake creator speaks out,” 2022, https://www.the-
Te funders had no role in the design of the study; in the guardian.com/technology/2021/mar/05/how-started-tom-cruise-
collection, analyses, or interpretation of data; in the writing deepfake-tiktok-videos.
of the manuscript; or in the decision to publish the results. [14] securitysystems, “New scam: call center gangs use deepfake to
create fake cops that can move their faces,” 2022, https://
www.securitysystems.in.th/2022/03/criminal-use-deepfake-cre-
Conflicts of Interest ate-fake-police/.
[15] Europol, “Europol Innovation Lab: facing reality? Law
Te authors declare that they have no conficts of interest. enforcement and the challenge of deepfakes,” 2022, https://
www.europol.europa.eu/cms/sites/default/fles/documents/
Acknowledgments Europol_Innovation_Lab_Facing_Reality_Law_Enforcement_
And_Te_Challenge_Of_Deepfakes.pdf.
An article publishing grant was provided by the King [16] M. Westerlund, “Te emergence of deepfake technology:
Mongkut’s Institute of Technology Ladkrabang and a review,” Technology innovation management review, vol. 9,
Walailak University, Nakhon Si Tammarat, Tailand. no. 11, pp. 39–52, 2019.
[17] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
vol. 521, no. 7553, pp. 436–444, 2015.
References [18] T. T. Nguyen, Q. V. H. Nguyen, D. T. Nguyen et al., “Deep
[1] J. Stewart, P. Sprivulis, and G. Dwivedi, “Artifcial intelligence learning for deepfakes creation and detection: a survey,”
and machine learning in emergency medicine,” Emergency Computer Vision and Image Understanding, vol. 223,
Medicine Australasia, vol. 30, no. 6, pp. 870–874, 2018. pp. 103525–103614, 2022.
[2] V. Kuleto, M. Ilić, M. Dumangiu et al., “Exploring oppor- [19] N. Agarwal, A. Sondhi, K. Chopra, and G. Singh, “Transfer
tunities and challenges of artifcial intelligence and machine learning: survey and classifcation,” Smart Innovations in
learning in higher education institutions,” Sustainability, Communication and Computational Sciences, vol. 145,
vol. 13, no. 18, pp. 10424–10516, 2021. pp. 145–155, 2020.
[3] M. Woschank, E. Rauch, and H. Zsifkovits, “A review of [20] A. Sengupta, Y. Ye, R. Wang, C. Liu, and K. Roy, “Going
further directions for artifcial intelligence, machine learning, deeper in spiking neural networks: VGG and residual ar-
and deep learning in smart logistics,” Sustainability, vol. 12, chitectures,” Frontiers in Neuroscience, vol. 13, p. 95, 2019.
no. 9, pp. 3760–3823, 2020. [21] Z. Wu, C. Shen, and A. Van Den Hengel, “Wider or deeper:
[4] R. Ciof, M. Travaglioni, G. Piscitelli, A. Petrillo, and revisiting the resnet model for visual recognition,” Pattern
F. De Felice, “Artifcial intelligence and machine learning Recognition, vol. 90, pp. 119–133, 2019.
applications in smart production: progress, trends, and di- [22] X. Xia, C. Xu, and B. Nan, “Inception-v3 for fower classi-
rections,” Sustainability, vol. 12, no. 2, pp. 492–526, 2020. fcation,” in Proceedings of the 2nd international conference on
[5] S. Aslam, H. Herodotou, S. M. Mohsin, N. Javaid, N. Ashraf, image, vision and computing (ICIVC), 2017 2nd International
and S. Aslam, “A survey on deep learning methods for power Conference on Image, Vision and Computing (ICIVC),
load and renewable energy forecasting in smart microgrids,” pp. 783–787, Chengdu, China, April 2017.
Renewable and Sustainable Energy Reviews, vol. 144, 2021. [23] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of
[6] M. N. Sadiku, T. J. Ashaolu, A. Ajayi-Majebi, and S. M. Musa, transfer learning,” Journal of Big data, vol. 3, pp. 9–40, 2016.
“Artifcial intelligence in social media,” International Journal [24] K. Gopalakrishnan, S. K. Khaitan, A. Choudhary, and
of Scientifc Advances, vol. 2, no. 1, pp. 15–20, 2021. A. Agrawal, “Deep convolutional neural networks with
[7] M. Caldwell, J. T. Andrews, T. Tanay, and L. D. Grifn, “AI- transfer learning for computer vision-based data-driven
enabled future crime,” Crime Science, vol. 9, pp. 14–13, 2020. pavement distress detection,” Construction and Building
[8] B. Dolhansky, J. Bitton, B. Pfaum et al., “Te deepfake de- Materials, vol. 157, pp. 322–330, 2017.
tection challenge (dfdc) dataset,” 2020, https://arxiv.org/abs/ [25] H. Wang, “Garbage recognition and classifcation system
2006.07397. based on convolutional neural network vgg16, 3rd In-
[9] A. O. Kwok and S. G. Koh, “Deepfake: a social construction of ternational Conference on Advanced Electronic Materials,” in
technology perspective,” Current Issues in Tourism, vol. 24, Proceedings of the Computers and Software Engineering
no. 13, pp. 1798–1802, 2021. (AEMCSE), pp. 252–255, Shenzhen, China, April 2020.
[10] S. Das, S. Seferbekov, A. Datta, M. S. Islam, and M. R. Amin, [26] H. Dang, F. Liu, J. Stehouwer, X. Liu, and A. K. Jain, “On the
“Towards solving the deepfake problem: an analysis on im- detection of digital face manipulation,” in Proceedings of the
proving deepfake detection using dynamic face augmenta- IEEE/CVF Conference on Computer Vision and Pattern rec-
tion,” in Proceedings of the IEEE/CVF International ognition, pp. 5781–5790, Seattle, WA, USA, June 2020.
Conference on Computer Vision, pp. 3776–3785, Virsual, [27] Y. Li and S. Lyu, “Exposing deepfake videos by detecting face
October 2021. warping artifacts,” in Proceedings of the IEEE/CVF Conference
[11] M. Li, B. Liu, Y. Hu, and Y. Wang, “Exposing deepfake videos on Computer Vision and Pattern Recognition Workshops, Long
by tracking eye movements,” in Proceedings of the 25th In- Beach, CA, USA, 46-52, June 2019.
ternational Conference on Pattern Recognition (ICPR), [28] C. Alippi, S. Disabato, and M. Roveri, “Moving convolutional
pp. 5184–5189, Milan, Italy, January 2021. neural networks to embedded systems: the alexnet and
[12] T. Chen, A. Kumar, P. Nagarsheth, G. Sivaraman, and VGG-16 case,” in Proceedings of the 17th ACM/IEEE In-
E. Khoury, “Generalization of audio deepfake detection,” in ternational Conference on Information Processing in Sensor
Proceedings of the Odyssey 2020 Te Speaker and Language Networks (IPSN), pp. 212–223, Porto, Portugal, April 2018.
Recognition Workshop, pp. 132–137, Tokyo, Japan, November [29] L. Wen, X. Li, X. Li, and L. Gao, “A new transfer learning
2020. based on VGG-19 network for fault diagnosis,” in Proceedings
Applied Computational Intelligence and Soft Computing 11

of the 23rd international conference on computer supported Conference on Artifcial Intelligence, pp. 1288–1294, Montreal-
cooperative work in design (CSCWD), pp. 205–209, Porto, themed, August 2021.
Portugal, May 2019. [44] C. Lu, B. Liu, W. Zhou, Q. Chu, and N. Yu, “Deepfake video
[30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning detection using 3D-attentional inception convolutional
for image recognition,” in Proceedings of the IEEE conference neural network,” in Proceedings of the International conference
on computer vision and pattern recognition, pp. 770–778, Las on image processing (ICIP), 2021 IEEE International Confer-
Vegas, Nevada, July 2016. ence on Image Processing (ICIP), pp. 3572–3576, Anchorage,
[31] J. Brownlee, “Long short-term memory networks with py- AL, USA, September 2021.
thon: develop sequence prediction models with deep learn- [45] O. De Lima, S. Franklin, S. Basu, B. Karwoski, and A. George,
ing,” Machine Learning Mastery, 2017. “Deepfake detection using spatiotemporal convolutional
[32] K. Smagulova and A. P. James, Overview of Long Short-Term networks,” 2021, https://arxiv.org/abs/2006.14749.
Memory Neural Networks, Deep Learning Classifers with [46] M. Elpeltagy, A. Ismail, M. S. Zaki, and K. Eldahshan, “A
Memristive Networks: Teory and Applications, Springer, novel smart deepfake video detection system,” International
Cham, 2020. Journal of Advanced Computer Science and Applications,
[33] D. C. Stanciu and B. Ionescu, “Deepfake video detection with vol. 14, no. 1, pp. 407–419, 2023.
facial features and long-short term memory deep networks,” [47] A. Ismail, M. Elpeltagy, M. S. Zaki, and K. Eldahshan, “An
in Proceedings of the International Symposium on Signals, integrated spatiotemporal-based methodology for deepfake
Circuits and Systems (ISSCS), 2021 International Symposium detection,” Neural Computing and Applications, vol. 34,
no. 24, pp. 21777–21791, 2022.
on Signals, Circuits and Systems (ISSCS), pp. 1–4, Iasi,
[48] M. C. Weerawardana and T. G. I. Fernando, “Deepfakes
Romania, July 2021.
detection methods: a literature survey,” in Proceedings of the
[34] D. Güera and E. J. Delp, “Deepfake video detection using
10th International Conference on Information and Automa-
recurrent neural networks,” in Proceedings of the 15th IEEE
tion for Sustainability (ICIAfS), pp. 76–81, Virtual, August
international conference on advanced video and signal based
2021.
surveillance (AVSS), pp. 1–6, Auckland, New Zealand, No- [49] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Ties, and
vember 2018. M. Nießner, “Faceforensics++: learning to detect manipulated
[35] H. Qassim, A. Verma, and D. Feinzimer, “Compressed facial images,” in Proceedings of the IEEE/CVF international
residual-VGG16 CNN model for big data places image rec- conference on computer vision, pp. 1–11, Seoul, Korea (South),
ognition,” in Proceedings of the 8th annual computing and October 2019.
communication workshop and conference (CCWC), pp. 169–
175, Las Vegas, NV, USA, January 2018.
[36] X. Chang, J. Wu, T. Yang, and G. Feng, “Deepfake face image
detection based on improved VGG convolutional neural
network,” in Proceedings of the 39th chinese control conference
(CCC), pp. 7252–7256, Shenyang, China, July 2020.
[37] W. Pu, J. Hu, X. Wang et al., “Learning a deep dual-level
network for robust DeepFake detection,” Pattern Recognition,
vol. 130, pp. 108832–108915, 2022.
[38] P. Yadav, I. Jaswal, J. Maravi, V. Choudhary, and G. Khanna,
“DeepFake detection using InceptionResNetV2 and LSTM,”
in Proceedings of the International Conference on Emerging
Technologies: AI, IoT, and CPS for Science Technology Ap-
plications, Virtual, September 2021.
[39] S. Kaur, P. Kumar, and P. Kumaraguru, “Deepfakes: temporal
sequential analysis to detect face-swapped video clips using
convolutional long short-term memory,” Journal of Electronic
Imaging, vol. 29, no. 03, p. 1, 2020.
[40] W. Rowan and N. Pears, “Te efectiveness of temporal de-
pendency in deepfake video detection,” 2022, https://arxi-
v.org/abs/2205.06684.
[41] A. Ismail, M. Elpeltagy, S. Zaki, and K. Eldahshan, “A new
deep learning-based methodology for video deepfake de-
tection using XGBoost,” Sensors, vol. 21, pp. 2–15, 2021.
[42] N. A. Rosli, S. N. H. Sheikh Abdullah, A. N. Zamani,
A. Ghazvini, N. S. Md Othman, and
N. A. A. A. Muarif Tajuddin, “Comparison multi transfer
learning models for deep fake image recognizer,” in Pro-
ceedings of the 3rd International Cyber Resilience Conference
(CRC), 2021 3rd International Cyber Resilience Conference
(CRC), pp. 1–6, Langkawi Island, Malaysia, January 2021.
[43] D. Zhang, C. Li, F. Lin, D. Zeng, and S. Ge, “Detecting
deepfake videos with temporal dropout 3DCNN,” in Pro-
ceedings of the International Joint Conference on Artifcial
Intelligence, Proceedings of the Tirtieth International Joint

You might also like