Deepfake Detection A Comparative Analysis
Deepfake Detection A Comparative Analysis
This paper present a comprehensive comparative analysis of supervised and self-supervised models for
deepfake detection. We evaluate eight supervised deep learning architectures and two transformer-based
models pre-trained using self-supervised strategies (DINO, CLIP) on four benchmarks (FakeAVCeleb, CelebDF-
V2, DFDC, and FaceForensics++). Our analysis includes intra-dataset and inter-dataset evaluations, examining
the best performing models, generalisation capabilities, and impact of augmentations. We also investigate the
trade-off between model size and performance. Our main goal is to provide insights into the effectiveness of
different deep learning architectures (transformers, CNNs), training strategies (supervised, self-supervised),
and deepfake detection benchmarks. These insights can help guide the development of more accurate and
reliable deepfake detection systems, which are crucial in mitigating the harmful impact of deepfakes on
individuals and society.
Additional Key Words and Phrases: deepfakes; visual content verification; convolutional neural networks;
transformers; video processing.
ACM Reference Format:
Sohail Ahmed Khan and Duc-Tien Dang-Nguyen. 2023. Deepfake Detection: A Comparative Analysis. 1, 1
(August 2023), 28 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
Deepfakes, or deepfake media, are digital media that have been generated or modified using deep
learning algorithms. They have gained notoriety in recent years due to their potential to manipulate
and deceive using artificial intelligence (AI) techniques. While deepfakes can be used for harmless
or even humorous purposes, they can also pose a serious threat when used for malicious purposes
such as creating convincing fake media to manipulate public opinion, influence elections, or incite
violence.
The research community has been working on proposing AI-based automated systems to detect
deepfakes. However, one of the major challenges in detecting deepfakes is that the deepfake
generation systems are constantly evolving and improving. With the availability of cheap compute
resources, and open-source software, it is becoming easier (even for people with limited technical
knowledge and expertise) to create realistic deepfakes that are harder to distinguish from the
real content. In addition to that, the deepfake detection, and generation is like a cat-and-mouse
game [12], where the researchers propose detection tools by exploiting certain shortcomings of the
generation systems. Soon after the release of the detection systems, the generation techniques are
reinforced and made undetectable for the previously proposed detection systems by overcoming
the exploited vulnerabilities. For example, in [25] researchers proposed a deepfake detector which
exploited eye blinking as a cue to detect deepfake media (they found that deepfake faces don’t blink
Authors’ addresses: Sohail Ahmed Khan, [email protected], SFI-MediaFutures, University of Bergen, Norway; Duc-Tien
Dang-Nguyen, University of Bergen, Norway, [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2023 Association for Computing Machinery.
XXXX-XXXX/2023/8-ART $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn
eyes). Soon after they released their study, the newer deepfake generation algorithms generated
videos which blinked eyes, and thus making the detection system useless.
Another prominent problem of the available deepfake detection algorithms is the lack of gener-
alisation capability. This means that the detection systems work excellently on detection deepfakes
coming from the same data distribution as the training data used to train these systems. However,
when exposed to deepfakes generated using different generation systems than the one which
generated training samples, the detectors fail to achieve similar performance. Numerous studies
have been proposed in the past which propose to employ different novel strategies to develop
detection systems, however, all of the proposed studies have one thing in common, and that is,
poor generalisation capability of the models on unseen data.
To address these issues, this study aims to provide insights into the challenge of detecting
deepfake media by comparing multiple deep learning models and deepfake detection benchmarks.
Specifically, we evaluate several different well-known image and video recognition architectures
for their effectiveness in detecting deepfakes. Our primary objective is to identify which of these
models perform well on unseen data as compared to other participating models.
To achieve this, we train all participating models on four deepfake detection datasets, including a
newly released dataset, and evaluate them in both intra-dataset 1 and inter-dataset 2 configurations
(see Figure 1). Additionally, we evaluate the difficulty level of each benchmark and investigate
whether a more challenging benchmark leads to better generalisation performance on unseen
data. To this end, we train participating models on all four datasets twice: first, without any image
augmentations, and then with various image augmentations to improve their performance.
We also analyse self-supervised Vision Transformer (ViT) architectures pre-trained using two
well-known strategies: DINO [5] and CLIP [31]. To study these models, we use self-supervised
ViT-Base models as feature extractors and train a classification head on top of them. It is important
to note that we only train the classification head and freeze the weights of the feature extractors to
avoid backpropagating gradients through them.
Overall, our study aims to answer several questions, such as which model has the highest
generalisation capability on unseen data, which dataset is most challenging for the models to learn,
which dataset enables the models to achieve the best generalisation capability on unseen data, and
which of the participating models and architectures are most successful for detecting deepfakes.
This next parts of this paper is organised as follows. In Section 2 we present a brief literature
review on the topic of deepfake detection. Section 3 presents the proposed framework. In Section 4
we present the results and discussion of our findings, and finally Section 5 concludes this study by
summarising our analysis, and presents future research direction.
2 LITERATURE REVIEW
Since recently quite a large number of research studies focused on deepfake media detection have
been proposed. Most of the proposed studies employ CNN models to detect deepfake media. The
proposed studies also employ different strategies e.g., novel augmentation techniques, ensemble
models, behavioral features, multimodal features, temporal features along with spatial information,
recurrent networks, transformer models etc to detect deepfake images/videos while trying to
increase the models’ generalisation capabilities. Below we present some well-known, as well as
some of the recently proposed deepfake detection studies.
In one of the earliest studies on deepfake media detection, Afchar et al. proposed two different
CNN models namely (1) Meso-4, and (2) MesoInception-4 [3]. Both of the proposed CNN networks
were comprised of a very small number of layers which focused on mesoscopic image details.
Authors tested their proposed models on one of the available deepfake detection benchmark. In
addition to that, authors also collected a custom dataset and tested the models on it as well achieving
excellent results on both participating datasets.
In [33] Sabir et al. proposed to detect deepfake media using a novel recurrent convolutional
network. Authors used DenseNet CNN, and combined it with a gated recurrent neural network
(RNN) to learn temporal features along with spatial features. The motivation was to detect incon-
sistencies within neighbouring frames of a video. Authors evaluated their model on the widely
known FaceForensics++ [32] deepfake detection benchmark showing promising results.
Rossler et al. in [32] proposed a deepfake detection benchmark, called FaceForensics++. Along
with the benchmark, authors proposed a simple CNN based deepfake detection technique using
XceptionNet [7]. Authors trained and evaluated the simple XceptionNet on their FaceForensics++
deepfake detection benchmark. They reported excellent performance scores on high-quality version
of the four subsets of the FaceForensics++ dataset [32], however, lost performance when evaluated
on low-quality videos.
In [28] Nguyen et al. proposed to employ capsule networks for deepfake detection. The proposed
technique was the first of its kind which proposed to employ capsule networks in contrast to most
of the other techniques which proposed to employ CNN models at that time. The capsule networks
based detection technique was evaluated on four different deepfake detection datasets comprising
of a wide variety of fake videos and images. The authors reported excellent evaluation statistics of
their proposed technique in comparison to other deepfake detection techniques.
Ciftci et al. in [8], developed a novel CNN and SVM based deepfake media detection model which
was trained on biological signals (i.e., photoplethysmography or PPG signals). The CNN and SVM
models make their individual predictions which were then fused together in order to get a final
classification score. This deepfake detection model achieved promising results when tested on a
number of different deepfake detection benchmarks including, CelebDF[26], Face Forensics, and
Face Forensics++ [32] datasets.
Zhu et al. in [47] proposed a deepfake detection system which employed 3D face decomposition
features in order to detect deepfakes. Authors showed that by merging the 3D identity texture and
direct light features significantly improved the detection performance while also making the model
to generalise well on unseen data when evaluated in cross-dataset setting. In this study, authors also
employed the XceptionNet CNN architecture for feature extraction. Both a face cropped image, and
its associated 3D detail were used to train their deepfake detection model. They also carried out an
in-depth analysis of several different feature fusion strategies. The proposed model was trained on
the FaceForensics++ [32] benchmark, and evaluated on (1) FaceForensics++, (2) Google Deepfake
Detection Dataset, and (3) DFDC [11] dataset. Promising evaluation statistics were reported for all
of the three participating datasets, while depicting the generalisation capability of the model when
compared to the previously proposed deepfake detection systems.
In [23], Khan et al., proposed to employ transformer architecture for the task of deepfake media
detection. Authors proposed a novel video based model for deepfake detection which was trained on
3d face features as well as standard cropped face images. Authors also showed that their proposed
model was capable of incrementally learning from new data without catastrophically forgetting
what it was trained on earlier. Authors evaluated their models on different widely used deepfake
detection benchmarks including FaceForensics++, DFDC, DFD and showed that their proposed
models achieved excellent results on all of the participating datasets.
[41] introduce a Multi-modal Multi-scale TRansformer (M2TR) model, which processes patches
of multiple sizes to identify local abnormalities in a given image at multiple different spatial
levels. M2TR also utilises the frequency domain information along with RGB information using a
Fig. 1. The proposed framework. The process involves several steps, starting with the extraction and cropping
of face frames from videos, followed by augmentation, normalisation, and resizing. The pre-trained models are
then used as feature extractors, with a new classification head (linear layer) added on top for supervised models.
During training, the weights of both the feature extractor and the classification head are updated for supervised
models, while only the newly added classification head is updated for self-supervised models. The models are
evaluated through both intra-dataset and inter-dataset evaluations to test their performance and generalisation
capabilities. For image models, the input data is a single cropped face image, while for video models, it is a tensor
containing eight consecutive cropped face images from a given video.
sophisticated cross-modality information fusion block to detect forgery related artifacts in a better
way. Through extensive experiments authors establish the effectiveness of M2TR, and show their
model outperforms SOTA Deepfake detection models by acceptable margins.
[9] propose a video deepfake detection model by employ a hybrid transformer architecture.
Authors used an EfficientNet-B0 as feature extractor. The extracted features were then used to
train two different types of Vision Transformer models in their study, e.g., (1) Efficient ViT, and (2)
Convolutional Cross ViT. Through experimentation, authors established that the model comprising
of EfficientNet-B0 feature extractor and Convolutional Cross ViT achieved the best performance
scores as compared to other models that they tested.
In [46], an Interpretable Spatial-Temporal Video Transformer (ISTVT) for deepfake detection was
proposed. The proposed model incorporates a novel decomposed spatio-temporal self-attention as
well as a self-subtract mechanism to learn forgery related spatial artifacts and temporal inconsisten-
cies. ISTVT can be also visualise the discriminative regions for both spatial and temporal dimensions
by using the relevance propagation algorithm [46]. Extensive experiments on large-scale datasets
were conducted, showing a strong performance of ISTVT both in intra-dataset and inter-dataset
deepfake detection establishing the effectiveness and robustness of proposed model.
Through this literature review it becomes apparent that the research community actively employs
deep learning based models along with other techniques to try develop robust and efficient deepfake
detectors. However, while carefully reading the research studies it also becomes noticeable that the
models perform poorly on unseen data. Also, there is a lack of comparative studies which aim to
identify which specific family of deep learning architectures is better than the others in detecting
deepfakes. Also, it is a bit difficult to make sense of the capability of datasets in providing the
generalisation capability to the models which can help them classify unseen data in a better way.
To address this, in this study we propose to employ some of the most frequently used architectures
(EfficentNets, Xception, Vision Transformers) in the literature of deepfake detection. We also
employ widely known datasets for experimentation, and try to find out the datasets offering best
generalisation capabilities to the models. We also analyse somewhat understudied approaches
for deepfake detection i.e., we train and evaluate the performance of self-supervised models and
compare their performance with supervised models.
3.1 Datasets
In this study we train and evaluate several different deep learning models on four deepfake detection
datasets/benchmarks: FakeAVCeleb [22], CelebDF-V2 [26], DFDC [11], and FaceForensics++ [32].
All of the four datasets comprise of real and fake videos, where fake videos are generated using
different deepfake generation methods. In upcoming sections, we present a brief description of
these datasets.
FaceForensics++ [32] is one of the most widely studied deepfake detection benchmarks. Face-
Forensics++ comprises of 1000 real video sequences (mostly from YouTube) of mostly frontal
faces and without any occlusions. These real videos were then manipulated using four different
face manipulation methods: (1) FaceSwap [2], (2) Deepfakes [1], (3) Face2Face [39], and (4) Neu-
ralTextures [38] to have four subsets. Each subset comprises of 1000 videos each. In total, the
dataset contains 5000 videos, i.e., 1000 real and 4000 fake videos. FaceForensics++ offers 3 different
qualities of data, (1) Raw, (2) High-Quality and (3) Low-Quality. In our study, we experimented the
high-quality videos.
FaceSwap and Deepfakes subset contains videos generated using what is called, face swapping.
As the name suggests, face of the target person is replaced with the face of source person and results
in transferring the identity of the source person onto the target. Face2Face and NeuralTextures
subsets are generated by a different process called, face re-enactment. In contrast to face swapping,
face re-enactment swaps the faces of source and target, however, keeps the original identity of the
target face (see Figure 2).
Fig. 2. Model size and its performance (Top-1 accuracy) on ImageNet [10].
Deepfake Detection Challenge (DFDC) [11] comprises of around 128k videos, out of which,
around 104k are fake. Similar to the FaceForensics++ dataset, DFDC dataset also comprises of videos
generated using more than one face manipulation algorithms. Five different methods were employed
to generate fake videos, namely, (1) Deepfake Autoencoder[11], (2) MM/NN[18], (3) NTH[45], (4)
FSGAN[29], and (5) StyleGAN[21]. In addition to these, a random selection of videos also underwent
a simple sharpening post-processing operation which increases the videos’ perceptual quality.
Unlike FaceForensics++ dataset, the DFDC dataset also contains videos having having undergone
audio-swapping. However, in this study we do not use audio features to train and evaluate our
models.
Since DFDC dataset is huge and we have limited resources, we only use a subset of the full
dataset to train and evaluate our models. For training we use roughly around 19.5K (around 16.5K
fake, and 3.1K real) videos from which we get 100k face cropped images (50k real, and 50k fake).
We use 20k images as validation set. For testing the models we use 4000 image frames randomly
selected from 3.5K (3.2K fake, and 0.3K real) videos.
CelebDF-V2 [26] contains 5639 fake, and 590 real videos. The real videos are collected from
YouTube, and contain interview videos of 59 celebrities having diverse ethnic backgrounds, genders,
age groups. CelebDF-V2 dataset comprises of fake videos generated using Encoder-Decoder models.
Post-processing operations are also employed to circumvent color mismatch, temporal flickering,
and inaccurate face masks.
FakeAVCeleb [22] is the most recently proposed deepfake detection dataset. FakeAVCeleb
dataset contains 19500 fake, and 500 real videos. This dataset also contains audio modality, and
manipulates audio as well as video content to generated deepfake videos. For video manipula-
tion, FaceSwap[24], and FSGAN[29] alogrithms are used. For audio manipulation, a real-time
voice cloning tool called SV2TTS[19], and Wav2Lip[30] are used. The dataset is divided into 4
subsets, i.e., (1) FakeVideo/FakeAudio, (2) RealVideo/RealAudio, (3) FakeVideo/RealAudio, and (4)
RealVideo/FakeAudio.
In this study, we only employ 2 of the mentioned subsets to train our models, i.e., (1) FakeV-
ideo/FakeAudio, and (2) RealVideo/RealAudio.
Table 1. The amount of real/fake images used to train, validate, and test our image models.
Train/Test Data
Train Validation Test
Dataset Real Fake Real Fake Real Fake
FakeAVCeleb [22] 47,099 45,912 9,301 9,301 2,000 2,000
CelebDF-V2 [26] 50,000 50,000 10,000 10,000 1,000 1,000
DFDC [11] 50000 50,000 10,000 10,000 2,000 2,000
FaceForensics++ [32] 50000 50,000 10,000 10,000 2,000 2,000
3.4 Models
We choose to experiment with six image recognition models trained using supervised strategy,
three of them are CNNs and the rest three are transformer based models. We also evaluate two
variants of transformer models trained using self-supervised strategies including (1) DINO [5],
and (2) CLIP [31]. Besides the image classification models, we also train and evaluate two different
video classification models, (1) ResNet-3D [16], which is a CNN model for video classification, and
(2) TimeSformer [4], which is a transformer model for video classification.
We select models based on their performance on the ImageNet dataset [10], the number of
parameters of the model, and for some models such as the Xception [7], we consider their previously
reported performance on the task of deepfake detection.
3.4.1 Image Models. Deepfake detection task is typically considered as an image classification task,
where a deep learning model is trained and evaluated on images separately (image-by-image), which
is in contrast to the video based deepfake detection where the models are trained and evaluated on
consecutive video frames to better detect the temporal inconsistencies present between different
frames of the video along with spatial cues. Whereas the image based deepfake detection models
only focus on learning to detect the spatial inconsistencies present in the images.
Fig. 3. Visual representation of the models used for analysis in this study. Due to space limitations, only basic,
key concepts for each model are illustrated instead of the whole model. For optimal understanding of the
essential components of each model, we recommend viewing this figure in color and at a higher magnification.
• Xception [7] is a convolutional neural network (CNN) architecture built upon the Inception
architecture [36], but proposes to use depth-wise separable convolutions instead of the traditional
Inception modules. Xception has a smaller number of trainable parameters as compared to some of
the other widely used deep CNN models, however, it still shows comparable performance to other
models having more parameters on the ImageNet benchmark [10]. Due to the smaller number of
parameters, Xception is less prone to over-fitting on unseen data, as well as performs smaller number
of computations, thus resulting in more efficient models. Depth-wise convolution is depicted in
the top left corner of Figure 3. Besides good performance on ImageNet benchmark, Xception was
also shown to achieve excellent results on deepfake detection task in studies conducted in the
past [32, 47]. Based on the results achieved by this architecture in the past on deepfake detection
we opt to analyse this architecture in this study as well.
• Res2Net-101 [15] is a convolutional neural network architecture. The main motivation behind
Res2Net is to improve upon the popular ResNet architecture [17] by introducing a new type of
building block called the "Res2Net Block", replacing the traditional bottleneck residual blocks of
ResNet. The Res2Net architecture represents multi-scale features at a granular level and increases
the range of receptive fields of each of the network layers. This results in a more efficient and
powerful network that can achieve better performance on a wide range of computer vision tasks,
such as, image classification, segmentation and object detection [15]. The proposed Res2Net block
can be easily incorporated into other state-of-the-art backbone CNN models, e.g., ResNet[17],
DLA[44], BigLittleNet[6], and ResNeXt[43]. Res2Net block is illustrated in the top right corner
of Figure 3. We employ Res2Net-101 in this study to analyse whether multi-scale CNN features
improve deepfake detection performance, and if so, does it also improves cross-dataset performance
(generalisation capability)?
• EfficientNet-B7 [37] is a convolutional neural network (CNN) architecture. The main idea behind
the EfficientNet architecture is to increase the efficiency of convolutional neural networks by scaling
the model’s architecture, and parameters in a systematic manner. The authors proposed a new
scaling technique that uniformly scales the depth, width, and resolution using a straightforward yet
highly effective compound coefficient. In simple words, instead of arbitrarily scaling up model width,
depth or resolution, the compound scaling strategy uniformly scales each dimension with a certain
fixed set of scaling coefficients. Using this method the authors proposed seven different models of
various scales [37]. The EfficientNet architecture achieves SoTA performance on a number of image
classification benchmarks while being more computationally efficient than other architectures such
as ResNet and Inception [37]. As is the case with Xception, a variant of EfficientNet architecture,
specifically the EfficientNet-B7 architecture was also shown to perform excellently on deepfake
detection task. The winning solution of the Google sponsored Deepfake Detection Challenge
(DFDC) was also based on these EfficientNet-B7 models [11]. We thus choose to study this model
in this paper.
• Vision Transformer (ViT Base) [40] is a class of neural network architectures based on the
transformer architecture, which was initially designed for natural language processing tasks. In
the context of computer vision, the Vision Transformer or simply ViT was the first transformer
based architecture to be made available for image classification task [13]. It uses the self-attention
mechanism to process visual data. The ViT uses a simple yet powerful approach, which is to divide
the image into small patches and feed them into a transformer model at once. The small patches
are then assigned positional embeddings in order to have an idea of the position of the image patch
in the original image. A classification token is then inserted at the start of this input, which is
then processed by the transformer encoder (similar to the encoder used in text related transformer
models). The model learns to attend to different patches of the image at the same time to make
predictions. By doing this, the network tends to better capture the context and relationships between
different parts of the image, leading to comparable performance as compared to the SOTA CNN
models on the ImageNet dataset after being trained on huge datasets, such as, the ImageNet-21k [10]
or the JFT-300M [35] image datasets. ViT architecture is presented in Figure 3, second row on the
right side. In this study, we train and evaluate the base version of vision transformer (ViT-Base)
model on the task of deepfake detection and compare its performance with other participating
models.
• Swin Transformer (Swin Base) [27] is a class of Vision Transformer models. It generates
hierarchical feature maps by combining image patches in deeper layers. It is computationally
efficient as compared to other vision transformer models, as it only performs self-attention within
each local window, resulting in linear computation complexity depending on the size of the input
image. In contrast, vanilla Vision Transformers produce feature maps of a single low resolution and
have quadratic computation complexity to the size of the input image, due to global self-attention
computation. Swin Transformer achieves comparable performance when compared with other
SoTA image classification models such as the EfficientNets[37]. Besides image classification, Swin
Transformers also perform well on tasks such as image segmentation, object detection. Figure 3,
third row on the right illustrates window generation, and attention calculation of Swin transformer.
Because of the excellent performance swin transformer achieve on ImageNet, we use it for the task
of deepfake detection, and try to study how it performs as compared to other participating models.
• Multiscale Vision Transformer (MViT-V2 Base) [14] is another class of vision transformer
model. Unlike traditional vision transformers, the MViTs have multiple stages that vary in both
channel capacity and resolution. These stages create a hierarchical pyramid of features, where
initial shallow layers focus on capturing low-level visual information with high spatial resolution,
while deeper layers extract complex, high-dimensional features at a coarser spatial resolution. This
approach allows the network to capture the context and relationships between different parts of the
image in a better way, which results in improved performance on a broad range of computer vision
tasks including image classification, image segmentation. A broad overview of the architecture of
MViT is shown on the left side, second row in Figure 3. Since MViTs are relatively new and achieve
excellent performance on different vision tasks, we employ these in our study to analyse how well
they perform on the task of deepfake detection.
• DINO[5] is a simple self-supervised training method, which is interpreted as a form of self-
DI stillation with NO labels. The authors adapted self-supervised methods to train ViT (vision
transformer) [13] architecture, and ViTs trained using supervised strategies. The authors make
the following observations in their study, i.e., (1) self-supervised ViT features incorporate explicit
information useful for computer vision tasks such as semantic segmentation, which does not come
along as evidently with supervised ViTs, and also not with CNNs; (2) self-supervised ViT features
are also shown to achieve excellent performance when tested as k-NN classifiers, attaining 78.3%
top-1 on ImageNet with a ViT-small architecture. For more details about the strategy, please see
in [5]. The DINO training strategy is shown in bottom right of Figure 3. Inspired from these findings,
we also employ ViT-Base[13] architecture trained using DINO [5]. In our study, we use the ViT-Base
as feature extractor, and add a classification head on top. We only train the added classification head
on participating deepfake detection datasets, while freezing the weights of the ViT-Base feature
extractor, i.e., we do not train the feature extractor, but only the classification head.
• Contrastive Language-Image Pre-Training (CLIP) [31] is a neural network that has been
trained on a diverse set of (image, text) pairs in a self-supervised manner. It has the ability to infer
the most suitable text excerpt for a given image using natural language, without explicit supervision
for this task. It exhibits zero-shot capabilities similar to the ones exhibited by GPT-2/GPT-3. In
CLIP’s original research paper, authors show that it achieves performance scores equivalent to the
original ResNet50 CNN model[17] when evaluated on ImageNet[10] in a "zero-shot" fashion, i.e.,
even though CLIP does not use any of the 1.28 million labelled examples from the original dataset
it achieves comparable performance as a ResNet50 model trained on ImageNet in a supervised
manner. CLIP is illustrated in the bottom left corner in Figure 3. For more details on CLIP, we refer
readers to [31]. We employ a ViT-Base model trained using CLIP as a feature extractor for our
study. Similar to DINO, we add a classification head on top of ViT-Base trained using CLIP. For our
analysis, we only train the classification head, and keep the CLIP ViT-Base features frozen i.e., we
do not update its weights during training.
3.4.2 Video Models. We studied two different video classification models, (1) ResNet-3D (a CNN
based video classifier), and (2) TimeSformer (a transformer based video classification model). We
study the performance of both these models on intra, as well as inter dataset performance on four
well-known deepfake detection benchmarks. We choose to study video based models in addition to
the image based detection models in order to find out whether temporal information help in the
detection task. Below we briefly describe these models.
• ResNet-3D [16] is based on the same principles as the original ResNet architecture [17], but
they are specifically designed to work with 3D data, such as videos and volumetric medical images.
These models use 3D convolutions, instead of 2D layers, for feature extraction. In addition to that,
ResNet-3D models generally use a large number of layers, which allows them to learn complex and
abstract features in the data. ResNet-3D models have been utilised for a variety of computer vision
tasks, including video classification, action recognition, and medical image segmentation. They
have been shown to achieve SoTA performance on a number of different benchmarks, however,
it is also worth noting that the ResNet-3D models are computationally costly, and need a large
amount of data to train. For reference, we illustrate both 2D and 3D convolutions in Figure 3, on
the left side of third row. We choose to employ ResNet-3D model for our study because, (1) it is
widely studied in regards of video recognition, (2) pre-trained models are available, and (3) the
available compute power which is not suitable for training bigger video recognition models using
more video frames for training. We chose ResNet-3D model pre-trained on 8 frames per video to
experiment in this study. In contrast, available video classification models are typically trained on
16/32 frames per video and tend to perform better than models trained using 8 frames per video.
• TimeSformer [4] is a video recognition model based on the transformer architecture. TimeS-
former utilises self-attention over space and time, instead of traditional convolutional layers, or
the spatial attention as employed by ViT for image recognition. The TimeSformer model modifies
the transformer architecture, generally used for image recognition, by directly learning the spatio-
temporal features from a sequence of frame-level patches. This is accomplished by extending the
self-attention mechanism from the image space to the 3D space-time volume. Similar to the Vision
Transformer (ViT) model, the TimeSformer employs linear mapping and positional information
to interpret ordering of the resulting sequence of features. In TimeSformer paper [4], authors
experimented with different self-attention techniques. Out of different techniques, the "divided
attention" technique which calculates temporal and spatial attention separately within each block,
was found to perform better than other self-attention calculation techniques, and thus we choose to
analyse the same architecture in this study. Divided space-time attention is illustrated in Figure 3,
in the middle of second row. We opt to evaluate TimeSformer on the task of deepfake detection,
and compare it with convolutional video classification network, ResNet-3D. We also chose 8 frame
per video version of the TimeSformer model, same as the ResNet-3D model we described above.
𝑁
1 ∑︁
𝐿=− [𝑦𝑖 log(𝑝𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑝𝑖 )] (1)
𝑁 𝑖=1
Where 𝐿 is the LogLoss, 𝑁 is the total number of samples in the dataset, 𝑦𝑖 is the true label of
the i-th sample, 𝑝𝑖 is the predicted probability for the i-th sample.
It is worth noting that Logloss is a widely used evaluation metric in machine learning competitions
such as Kaggle competitions, as it gives a general idea of how good the predictions of the model
are. We use LogLoss as one of the evaluation metrics in this study as other previously proposed
deepfake detection research studies often use it as their evaluation metric, and thus would allow us
to compare our results with them.
3.5.2 Area Under the Curve (AUC). is also a widely known metric used to evaluate classification
models. AUC basically refers to calculating the entire two-dimensional area under the Receiver
Operating Curve (ROC). AUC gives hints about how well a model has made a certain prediction.
Quite understandably, the higher the area falling under the ROC, i.e., AUC, the better the perfor-
mance of the model at discriminating between "real" and "fake" samples in our case. Most of the
recently proposed deepfake detection studies employ AUC as the evaluation metric to study the
performance of their models.
Note that the ROC curve is created by varying the threshold used to make predictions from 0 to
1, so the AUC provides a summary of the model’s performance across all possible thresholds.
3.5.3 Accuracy. is also a widely used metric in the classification domain. Accuracy score is basically
the measure of correct predictions made by a model in relation to all the predictions made by the
model. Accuracy does not indicate how well a model has made a certain classification, as was the
case with LogLoss and AUC. Accuracy score can be obtained by dividing the number of correct
predictions by total predictions.
𝑇𝑃 +𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (2)
𝑇𝑃 + 𝐹𝑃 + 𝑇 𝑁 + 𝐹𝑁
Where 𝑇 𝑃 is the number of true positives, 𝑇 𝑁 refers to the number of true negatives, 𝐹 𝑃 refers
to the number of false positives, and 𝐹 𝑁 refers to the number of false negatives.
It is worth noting that accuracy is the proportion of correctly classified samples out of the total
number of samples. It is a common evaluation metric used in binary classification tasks, however,
it can be misleading in cases where the classes (real, fake) are imbalanced, or if the cost associated
with the false positives and false negatives is different. In such cases, other evaluation metrics like
F1 score, precision, recall, or AUC may provide a more accurate evaluation of the classification
model’s performance [34]. In our study however, since we have balanced number of samples both
for real and fake classes, we can use accuracy as one of the evaluation metric.
head on top of self-supervised feature extractors i.e., DINO and CLIP. For image augmentations we
rely on imgaug [20] library
Fig. 4. Performance (accuracy) comparison of participating models on all datasets. The reported scores were
achieved in an intra-dataset evaluation. Results in this figure are obtained by, (1) evaluating each model
separately on each dataset, and (2) averaging the achieved scores i.e., add the 4 accuracy scores and divide by
4.
4 RESULTS
We conducted extensive experiments and evaluated six image deepfake detection models, as well
as two video deepfake detection models on four different benchmarks (the details were discussed
in Section 3). In addition to this, we also evaluate two vision transformer (ViT-Base) models pre-
trained using self-supervised techniques mentioned in Section 3. We evaluated all models in both
intra-dataset as well as inter-dataset settings. In the following sections, we report the performance
of all the participating models both in an intra-dataset (trained and evaluated on same dataset), as
well as inter-dataset (trained on one dataset and evaluated on the remaining datasets excluding the
training dataset) settings.
In this section we refer to models as supervised models, and self-supervised models. Supervised
models refer to eight models including six image models, and two video models. Self-supervised
models refer to DINO, CLIP and a supervised ViT-Base (which is used as a feature extractor to
compare with DINO, and CLIP based ViT-Base). Supervised models are trained end to end i.e.,
weights of feature extractor as well as the classification head are updated during training. In case
of self-supervised models including DINO, CLIP, and a supervised ViT-Base, the weights of feature
extractors are kept frozen during training and only the classification head is trained.
DINO, and CLIP are also ViT-Base models, however, the only difference is that both DINO
and CLIP are pre-trained using self-supervised training strategies. The supervised ViT-Base is
pre-trained using supervised training strategy. Through training a classification head on top of
these three models we aim to find out whether self-supervised features provide better feature
representations as compared to supervised features.
Table 2. Intra-dataset comparison of image models. The table below presents scores achieved by image
models when trained and evaluated on FakeAVCeleb [22] dataset. Best results are highlighted in yellow.
FakeAVCeleb
With Augs No Augs
Model LogLoss AUC ACC LogLoss AUC ACC
Xception 0.0047 100.00% 99.93% 0.0040 100.00% 99.85%
Res2Net-101 0.0008 100.00% 99.98% 0.0037 100.00% 99.93%
EfficientNet-B7 0.0132 100.00% 99.63% 0.0047 100.00% 99.83%
ViT 0.2073 99.29% 94.60% 0.3768 98.78% 92.43%
Swin 0.0033 100.00% 99.88% 0.0058 100.00% 99.83%
MViT 0.0008 100.00% 100.00% 0.0023 100.00% 99.95%
ResNet-3D 0.0041 100.00% 100.00% 0.0066 100.00% 100.00%
TimeSformer 0.0796 99.96% 97.50% 0.1238 99.94% 97.00%
4.1 FakeAVCeleb
FakeAVCeleb [22] is a newly proposed deepfake detection dataset containing four different cate-
gories of videos i.e., (1) FakeVideo/FakeAudio, (2) RealVideo/RealAudio, (3) FakeVideo/RealAudio,
and (4) RealVideo/FakeAudio. Since we focus only on visual deepfakes in this study, and thus do
not use the audios (real and fake) for training and evaluating our models. In fact, out of four subsets
of the FakeAVCeleb dataset, we only use two for our experiments i.e., (1) FakeVideo/FakeAudio, (2)
RealVideo/RealAudio.
Table 2 shows that all models perform pretty well in distinguishing between fake and real faces.
We can see that all of the participating models achieved almost 99% AUC, and very low LogLoss
score when tested in an intra-dataset configuration. The numbers in 2 suggest that FakeAVCeleb
dataset is relatively easy and thus the models can accurately distinguish between real and fake
samples.
In table 10 we report results achieved by all the models when trained on FakeAVCeleb, and
evaluated on the remaining three datasets. When we look at the numbers in Table 10, it is apparent
that almost all of the models perform poorly on all the other datasets. We can see that in terms of
accuracy scores, the models are making random guesses. LogLoss, and AUC scores are also not
remarkably good in inter-dataset evaluation.
In case of self-supervised models, the results are not as good as they are for the supervised
models. That is because the self-supervised models are not trained in an end to end manner as
we mentioned earlier. However, on FakeAVCeleb dataset, even though only the classification head
is trained, still DINO, and ViT-Base (the supervised feature extractor) achieve good performance.
However, DINO performs significantly better than the other two models as shown in Table 8. CLIP
does not achieve good performance, and this might be because CLIP was initially pre-trained using
both images and their associated text captions, however, in this study we use CLIP’s image encoder
only without any text captions. This might be a reason of bad performance by CLIP. We aim at
investigating this issue along with the inter-dataset analysis of self-supervised models in our future
research.
From the results we can infer that FakeAVCeleb dataset is not challenging enough for the models
to learn, and is fairly easy to distinguish between fake and real samples for both supervised and
self-supervised models. In addition to that, this dataset does not enhance the models’ ability to
learn distinguishing features between real and fake faces, or in other words, it lacks at integrating
the generalisation capability into the models, as is apparent from Table 9 and 10.
Table 3. Intra-dataset comparison of image models. The table below presents scores achieved by image
models when trained and evaluated on CelebDF-V2 [26] dataset.
CelebDF
With Augs No Augs
Model LogLoss AUC ACC LogLoss AUC ACC
Xception 0.0712 99.73% 97.00% 0.0367 99.95% 98.55%
Res2Net-101 0.0237 100.00% 98.95% 0.0185 99.99% 99.45%
EfficientNet-B7 0.0433 99.95% 98.40% 0.0340 99.98% 98.75%
ViT 0.0336 99.96% 98.60% 0.0350 99.95% 98.60%
Swin 0.0340 99.94% 98.80% 0.0202 99.97% 99.40%
MViT 0.0075 100.00% 99.70% 0.0096 100.00% 99.70%
ResNet-3D 0.0748 99.68% 97.00% 0.1525 98.68% 95.00%
TimeSformer 0.0309 100.00% 98.00% 0.0220 99.96% 99.00%
4.2 CelebDF-V2
Table 3 presents the performance of supervised models when trained and evaluated on CelebDF-
V2 [26] dataset. Same as it was the case with FakeAVCeleb dataset, almost all of the participating
models achieve excellent scores i.e., more than 97% accuracy, and more than 99% AUC score,
while having a very small LogLoss. We can thus infer that the models quite comfortably learnt to
discriminate between real/fake samples of the CelebDF-V2 dataset, similar to FakeAVCeleb dataset.
In order to find out how helpful the dataset is in making models learn robust features making them
better at generalisation, we also conduct extensive inter-dataset evaluation of all the participating
models trained on CelebDF-V2. We report the results achieved by the models in Table 11. However,
similar to the results achieved by the models trained on FakeAVCeleb dataset and evaluated on the
remaining datasets, the models trained on CelebDF-V2 and evaluated in an inter-dataset setting
also seem to perform poorly. This might be a result of CelebDF-V2 not being a very challenging
dataset for the models to discriminate, and they can almost classify every real/fake sample in a
perfect manner. However, this also makes the models less powerful against unseen data, as can be
seen by the performance scores reported in table 11.
The fact that CelebDF-V2 is also not a challenging dataset is also supported by the results
achieved by the self-supervised models as presented in Table 8. It is apparent from the numbers
that only training a classification head on top of frozen feature extractors, models still achieve good
results. However, in this case as well, CLIP does not achieve good performance as compared to
DINO and supervised ViT.
Table 4. Intra-dataset comparison of image models. The table below presents scores achieved by image
models when trained and evaluated on FaceForensics++ [32] dataset.
FaceForensics++
With Augs No Augs
Model LogLoss AUC ACC LogLoss AUC ACC
Xception 0.2342 96.96% 91.05% 0.2957 95.85% 89.03%
Res2Net-101 0.2165 97.87% 93.48% 0.3213 97.30% 91.85%
EfficientNet-B7 0.3111 96.92% 90.33% 0.3737 94.02% 86.95%
ViT 0.2445 97.27% 92.18% 0.3571 94.04% 85.15%
Swin 0.1573 98.58% 94.90% 0.2191 97.60% 92.18%
MViT 0.1828 98.34% 94.10% 0.1918 97.63% 93.00%
ResNet-3D 0.3224 96.42% 90.36% 0.3085 96.19% 91.07%
TimeSformer 0.2807 97.10% 90.00% 0.2451 96.76% 90.71%
4.3 FaceForensics++
Table 4 reports the performance metrics of all the supervised models when trained and evaluated
on the FaceForensics++ [32] dataset. We can see that the results are not as good as they were in the
case of previous two datasets, FakeAVCeleb and CelebDF-V2. Almost none of the models achieved
more than 95% accuracy, and the LogLoss scores are also not as great as they were for the previous
two datasets. We can thus infer by looking at the metrics that it is a relatively challenging dataset
for the models to distinguish between real/fake samples. Self-supervised models also are not able
to achieve excellent results on FaceForensics++ dataset as is apparent from the numbers in Table 8,
affirming that it is indeed challenging to properly distinguish between fake and real faces coming
from FaceForensics++ dataset. What we would now like to see now is, whether a more challenging
dataset means better generalisation capability?
We thus evaluate all of the supervised models trained on FaceForenscis++ dataset in an inter-
dataset evaluation setting and report the results in Table 12. In this case, it can be clearly seen that
the models perform in a somewhat acceptable manner even on unseen data from other datasets. For
example, MViT trained on FaceForenscis++ and evaluated on FakeAVCeleb dataset achieves more
than 80% accuracy, and more than 90% AUC score. This also supports our claim that FakeAVCeleb
as well as CelebDF-V2 datasets are not very challenging, and that the models can easily learn to
distinguish real/fake videos coming from these datasets.
Furthermore, not only on the FakeAVCeleb dataset, we can also see somewhat better performance
from all of the participating supervised models on the other two datasets, i.e., CelebDF-V2 and
DFDC. However, it must be noted that how all the models suffer when tested using unseen data, i.e.,
lack of generalisation, which is a big problem the current, more sophisticated deepfake detection
systems suffer from. The results in Table 12 somehow support the statement that more challenging
datasets mean better generalisation capability. But we have to further re-enforce this statement
after analysing the metrics of models trained using DFDC [11] dataset in the section below.
4.4 DFDC
DFDC is one of the biggest and most challenging deepfake detection benchmarks. This is apparent
by the results we present in Table 5. Only one of the supervised models (i.e., Res2Net-101) managed
Table 5. Intra-dataset comparison of image models. The table below presents scores achieved by image
models when trained and evaluated on DFDC [11] dataset.
DFDC
With Augs No Augs
Model
LogLoss AUC ACC LogLoss AUC ACC
Xception 0.5613 88.75% 77.63% 0.5120 91.68% 80.65%
Res2Net-101 0.5570 90.64% 79.98% 0.5691 91.78% 83.45%
EfficientNet-B7 0.5542 89.97% 79.30% 0.4263 93.30% 84.15%
ViT 0.4696 91.89% 81.08% 0.5709 89.44% 78.35%
Swin 0.5602 90.89% 82.60% 0.6650 87.77% 79.05%
MViT 0.6079 88.41% 78.90% 0.5491 90.65% 82.40%
ResNet-3D 0.5865 85.64% 75.75% 0.6739 84.69% 73.50%
TimeSformer 0.4870 91.18% 83.25% 0.6176 92.30% 81.75%
to achieve more than 84% accuracy score, 93% AUC score on the DFDC dataset. Self-supervised
models also achieve relatively low scores when trained and evaluated on DFDC, as apparent from
Table 8. This establishes that DFDC is the most challenging dataset out of all the four datasetS in
this study.
In Table 13 we present inter-dataset evaluation scores achieved by the supervised models trained
on DFDC dataset. It is apparent from the results that the models trained using DFDC dataset
still achieve acceptable performance on unseen data, as compared to the scores achieved by the
models which were trained on FakeAVCeleb, and CelebDF-V2. Also, by looking at the results we
can somewhat affirm the statement that models trained using more challenging datasets seem to
achieve better results. We say somewhat, because in the scope of this study, even though DFDC
is more challenging to learn for the models, a better generalisation is offered by FaceForensics++
which is relatively less challenging to learn for the models.
4.5 Discussion
In Figure 4 we present a comparison of all the participating models on the basis of achieved
accuracies while evaluated in an intra-dataset setting (i.e., models are trained and tested on same
dataset). From the figure, it is apparent that there is not a lot of performance difference between the
participating models. In most cases, the models achieves around 92% to 94% accuracies when tested
in an intra-dataset evaluation setting. The figure also shows that the image augmentations do not
prove to be significantly helpful in all cases, for example, we can see that for XceptionNet, Res2Net-
101, MViT-V2-Base, and EfficientNet-B7, the models trained without image augmentations achieved
better scores when compared to models trained using image augmentations. The difference between
the accuracies achieved by the models when trained with and without image augmentations is
not big except for the ViT. The ViT trained with image augmentations achieved 91.62% while the
ViT trained without image augmentations achieved 88.63% accuracy. However, it is apparent from
the Figure 4 that all of the transformer models perform better when trained using augmentations.
Besides this, video models also perform when trained using augmentations. Another reason to no
disregard image augmentations is that the best performing model Swin-Base achieved top most
accuracy while being trained using image augmentations.
Fig. 5. ROC curves of each of the model when evaluated on each of the 4 different participating datasets in
an intra-dataset evaluation setting.
Also, it can be noted that the transformer models (Swin-Base, and MViT-V2-Base) outperform the
CNN based models. We can also see that the Res2Net-101 model also achieves excellent performance
scores in intra-dataset evaluation setting, while having almost half the amount of parameters (43
million parameters) as compared to the best performing Swin-Base (87 million parameters) model.
From Figure 4, and Table 6 we make one useful observation, i.e., models with multi-scale feature
processing capabilities (Res2Net, MViT-V2, and Swin Transformer) are the best performing models
out of all.
Moving towards inter-dataset analysis, we illustrate results achieved by supervised models when
evaluated in an inter-dataset setting in Figure 7. From the figure it is apparent that in inter-dataset
evaluation, models perform in a significantly poor manner, as compared to intra-dataset evaluation.
However, this is understandable i.e., detection models loose performance when tested on data
coming from a different distribution. However, their is a useful finding in Figure 7, i.e., for all
datasets, the best performing models are transformers.
In case of self-supervised models, by analysing the results in Table 7 and 8, we can infer that
self-supervised features (DINO) indeed provide better feature representations as compared to
supervised feature representations. DINO seems to outperform both supervised ViT as well as CLIP
as is also apparent from the ROC curvers illustrated in Figure 8.
We also visualise the TSNE plots for all the participating models in Figure 6, to get an idea about
how the models separate real faces from those of the fake ones. Also, it gives us an idea about how
the models group together faces coming from same datasets near to each other as compared to the
faces coming from a different dataset. The TSNE plots also help us visualise which datasets are
more challenging than the others. For example, if we look at the TSNE plots in Figure 6, we can see
that the models tend to separate the easier datasets (FakeAVCeleb, and CelebDF-V2) in a better way,
Fig. 6. TSNE visualisations of the participating detection models. We chose the best performing models on
all datasets (with/without image augmentations).
as compared to how they separate the more challenging datasets (FaceForensics++, and DFDC). We
can see that the Res2Net-101 model does this separation in a better way as compared to the other
models i.e., even better than the best performing Swin-Base model (we don’t know how/why). The
plots also support our findings that FaceForensics++ and DFDC are indeed challenging datasets, and
that the models are not as good at separating the fake, and real faces coming from these datasets as
they can separate the fake, and real faces coming from FakeAVCeleb and CelebDF-V2 datasets.
Another finding which is quite apparent is that the image models do this separation job relatively
better than those of the video models. This is understandable, as we pointed out above that the video
models need relatively larger amounts of training data (in our case we train both the image/video
models on the same amount of data). Also, the video models we use are not the newest, most
powerful models out there, however, we choose these because of the lack of compute resources,
and for the sack of experimentation.
We also present the ROC curves of the participating models evaluated in an intra-dataset setting
in Figure 5. The AUC scores achieved by the models also show that FakeAVCeleb and CelebDF-V2
datasets are easier to learn for the models, as compared to FaceForensics++ and DFDC datasets. This
also suggests that while training the models for deepfake detection, it will give better generalisation
capability to the models when they are trained on challenging datasets, rather than the easier ones.
𝑠1 + 𝑠2 + 𝑠3 + 𝑠4
Score = (3)
4
Table 6. This table compares the performance of all the participating (supervised) models. We present scores
after averaging the scores (LogLoss, AUC, Accuracy) achieved by each model when evaluated in an intra-
dataset setting as given in Equation 3.
where s1 refers to score (LogLoss, AUC, ACC) achieved by a model when trained and evaluated on
dataset1, s2 refers to score (LogLoss, AUC, ACC) achieved by a model when trained and evaluated
on dataset2 and so on. The scores reported in Tables 6 and 7 for each model are calculated using
this equation.
Table 7. This table compares the performance of the self-supervised models. We present scores after averaging
the scores (LogLoss, AUC, Accuracy) achieved by each model when evaluated in an intra-dataset setting, as
given in Equation 3. In this table, Supervised refer to ViT-Base model pre-trained using supervised training
scheme. DINO refers to ViT-Base model pre-trained using self-supervised scheme proposed in [5], and CLIP
refers to ViT-Base pre-trained using self-supervised scheme prposed in [31]. All of these ViT-Base models are
used as feature extractors, where we only train a classification head on top of each of the feature extractor,
and freeze the weights of feature extractors.
5 CONCLUSIONS
In conclusion, this paper investigates the performance of various image and video classification
architectures (supervised, self-supervised) on the task of deepfake detection when trained and
evaluated on four different datasets. We aimed at identifying which models perform better than
other participating models, which model generalises well on unseen data as compared to the other
models. Through experimentation and analysis of the achieved results we conclude that models
possessing the capability of processing multi-scale features (Res2Net-101, MViT-V2, and SWIN
Transformer) achieve better overall performance in intra-dataset comparison. For inter-dataset
Table 8. This table compares the performance of all the participating (self-supervised) models when evaluated
in an intra-dataset setting. The statistics of this table are illustrated in Figure 8.
comparison, or in other words, the generalisation capability comparison, we infer from the results
that transformer models perform better than that of the CNN models. It is also apparent from
the results obtained through both inter-dataset as well as intra-dataset comparisons, the image
augmentations do not always help achieve better performance scores.
Through intra-dataset comparisons we establish that the DFDC dataset is the most challenging
dataset to learn for the models, whereas FaceForensics++ dataset is ranked as second most chal-
lenging dataset. However, through inter-dataset evaluation, we establish that the FaceForensics++
dataset offers the best generalisation capabilities to the models, as compared to other datasets. DFDC
ranks second in providing the generalisation capabilities. The remaining two datasets FakeAVCeleb,
and CelebDF-V2 appear to be fairly easy for the models to learn and achieve excellent performance
in intra-dataset comparison. However, they do not provide the models with any generalisation
capability, i.e., models trained using these datasets perform poorly when evaluated on other datasets.
In addition to analysing supervised image/video recognition models, we also explore the perfor-
mance of self-supervised models for deepfake detection in an intra-dataset setting. Through our
experiments we find the ViT-Base model which was pre-trained using DINO [5] to achieve better
performance as compared to the supervised ViT-Base, and self-supervised CLIP ViT-Base. We also
find in these experiments these models achieve better performance scores when trained without
using image augmentations.
All in all, we present a detailed analysis of the performance achieved by several different deepfake
detection architectures on four different deepfake detection benchmarks. We carry out extensive
experiments and provide detailed results along with useful visualisations to help understand
the overall contributions of this paper to the reader. We regard this study as an entry point for
researchers exploring the research field of deepfake detection who are trying to make sense of
different architectures and datasets to develop their own solutions. We are confident that this study
provides useful insights into the problem of deepfake detection.
In future work, we aim at analysing even more diverse set of architectures, and newer datasets.
In addition to that we plan to focus more towards self-supervised training strategies to train models,
as well as try to incorporate knowledge distillation, domain adaptation strategies to help make
models better at classifying unseen samples correctly.
ACKNOWLEDGMENTS
This research was supported by industry partners and the Research Council of Norway with funding
to MediaFutures: Research Centre for Responsible Media Technology and Innovation, through the
Centres for Research-based Innovation scheme, project number 309339.
REFERENCES
[1] Accessed: 2022-12-02. Deepfakes github. https://github.com/deepfakes/faceswap.
[2] Accessed: 2022-12-02. FaceSwap github. https://github.com/MarekKowalski/FaceSwap/.
[3] Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. 2018. MesoNet: a Compact Facial Video Forgery
Detection Network. In Proceedings of IEEE International Workshop on Information Forensics and Security (WIFS). IEEE.
https://arxiv.org/abs/1809.00888
[4] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is Space-Time Attention All You Need for Video Under-
standing?. In International Conference on Machine Learning.
[5] Mathilde Caron, Hugo Touvron, Ishan Misra, Herv’e J’egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021.
Emerging Properties in Self-Supervised Vision Transformers. 2021 IEEE/CVF International Conference on Computer
Vision (ICCV) (2021), 9630–9640.
[6] Chun-Fu Chen, Quanfu Fan, Neil Rohit Mallinar, Tom Sercu, and Rogério Schmidt Feris. 2018. Big-Little Net: An
Efficient Multi-Scale Feature Representation for Visual and Speech Recognition. ArXiv abs/1807.03848 (2018).
[7] François Chollet. 2017. Xception: Deep Learning with Depthwise Separable Convolutions. 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) (2017), 1800–1807.
[8] Umur Aybars Ciftci, Ilke Demir, and Lijun Yin. 2020. FakeCatcher: Detection of Synthetic Portrait Videos using Biological
Signals. In IEEE Transactions on Pattern Analysis and Machine Intelligence. IEEE. https://arxiv.org/abs/1901.02212
[9] Davide Alessandro Coccomini, Nicola Messina, Claudio Gennaro, and F. Falchi. 2021. Combining EfficientNet and
Vision Transformers for Video Deepfake Detection. In International Conference on Image Analysis and Processing.
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, K. Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image
database. 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009), 248–255.
[11] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. 2020.
The DeepFake Detection Challenge (DFDC) Dataset. arXiv: Computer Vision and Pattern Recognition (2020).
[12] Luke Dormehl. 2021. Inside the rapidly escalating war between deepfakes and deepfake detectors. https://www.
digitaltrends.com/cool-tech/deepfake-detection-war/.
[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa
Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is
Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv abs/2010.11929 (2021).
[14] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer.
2021. Multiscale Vision Transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021),
6804–6815.
[15] Shanghua Gao, Ming-Ming Cheng, Kai Zhao, Xinyu Zhang, Ming-Hsuan Yang, and Philip H. S. Torr. 2021. Res2Net: A
New Multi-Scale Backbone Architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2021),
652–662.
[16] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2017. Learning Spatio-Temporal Features with 3D Residual
Networks for Action Recognition. 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) (2017),
3154–3160.
[17] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 770–778.
[18] Dong Huang and Fernando De la Torre. 2012. Facial Action Transfer with Personalized Bilinear Regression. In European
Conference on Computer Vision.
[19] Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Z. Chen, Patrick Nguyen, Ruoming Pang, Ignacio
Lopez-Moreno, and Yonghui Wu. 2018. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech
[43] Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2016. Aggregated Residual Transformations
for Deep Neural Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 5987–5995.
[44] Fisher Yu, Dequan Wang, and Trevor Darrell. 2017. Deep Layer Aggregation. 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (2017), 2403–2412.
[45] Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor S. Lempitsky. 2019. Few-Shot Adversarial Learning
of Realistic Neural Talking Head Models. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019),
9458–9467.
[46] Cairong Zhao, Chutian Wang, Guosheng Hu, Haonan Chen, Chun Liu, and Jinhui Tang. 2023. ISTVT: Interpretable
Spatial-Temporal Video Transformer for Deepfake Detection. IEEE Transactions on Information Forensics and Security
18 (2023), 1335–1348.
[47] Xiangyu Zhu, Hao Wang, Hongyan Fei, Zhen Lei, and S. Li. 2021. Face Forgery Detection by 3D Decomposition. In
Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2928–2938.
Fig. 7. Performance (accuracy) comparison of participating models evaluated using inter-dataset scheme.
Results in this figure are obtained by, (1) evaluating each model trained on one dataset on each of the
remaining datasets, and (2) averaging the achieved scores, i.e., add the 3 accuracy scores and divide by 3.
Table 9. This table compares the performance of all the participating (supervised) models evaluated in an
inter-dataset setting. We present scores after averaging the scores (LogLoss, AUC, Accuracy) achieved by
each model on each of the datasets. Figure 7 illustrate the statistics of this table.
Inter-Dataset Evaluation
With Augs No Augs
Training Dataset
Model LogLoss AUC ACC LogLoss AUC ACC
Table 10. Inter-dataset evaluation scores of models trained on FakeAVCeleb [22] dataset and evaluated on
the remaining three datasets.
Table 11. Inter-dataset evaluation scores of models trained on CelebDF-V2 [26] dataset and evaluated on the
remaining three datasets.
Table 12. Inter-dataset evaluation scores of models trained on FaceForensics++ [32] dataset and evaluated on
the remaining three datasets.
Table 13. Inter-dataset evaluation scores of models trained on DFDC [11] dataset and evaluated on the
remaining three datasets.
Fig. 8. ROC curves of self-supervised models trained and evaluated on each dataset using the intra-dataset
evaluation scheme.