TAENet Two-Branch Autoencoder
TAENet Two-Branch Autoencoder
DFRWS APAC 2024 - Selected Papers from the 4th Annual Digital Forensics Research Conference APAC
A R T I C L E I N F O A B S T R A C T
Keywords: Deepfake detection attracts increasingly attention due to serious security issues caused by facial manipulation
Deepfake detection techniques. Recently, deep learning-based detectors have achieved promising performance. However, these
Interpretibility representation detectors suffer severe untrustworthy due to the lack of interpretability. Thus, it is essential to work on the
Image forensic
interpretibility of deepfake detectors to improve the reliability and traceability of digital evidence. In this work,
Disentanglement learning
we propose a two-branch autoencoder network named TAENet for interpretable deepfake detection. TAENet is
composed of Content Feature Disentanglement (CFD), Content Map Generation (CMG), and Classification. CFD
extracts latent features of real and forged content with dual encoder and feature discriminator. CMG employs a
Pixel-level Content Map Generation Loss (PCMGL) to guide the dual decoder in visualizing the latent repre-
sentations of real and forged contents as real-map and fake-map. In classification module, the Auxiliary Classifier
(AC) serves as map amplifier to improve the accuracy of real-map image extraction. Finally, the learned model
decouples the input image into two maps that have the same size as the input, providing visualized evidence for
deepfake detection. Extensive experiments demonstrate that TAENet can offer interpretability in deepfake
detection without compromising accuracy.
1. Introduction they cannot elucidate why an image is deemed fake or identify which
parts is the decision-making region of the image (Wang et al., 2022a).
Benefiting from the successful application of generative models in The lack of interpretability in these models implies low trustworthiness,
the field of computer vision, deepfake technologies, represented by making practical deployment challenging. As a result, developing
Autoencoders (AE) (Badrinarayanan et al., 2017) and Generative effective and interpretable deepfake detection algorithm is vitally
Adversarial Networks (GAN) (Goodfellow et al., 2020), have rapidly essential.
developed and garnered widespread attention. Deepfakes are charac- Some recent works have focused on this imminent problem and
terized by their low threshold for creation and high realism of forged attempted to provide reasonable explanations to improve the inter-
images, which increases their risk of misuse. The abuse of fake tech- pretability for deepfake detection. These works can be roughly catego-
nology can lead to the dissemination of false information, fraud online, rized into two branches: 1) Saliency Map-Based methods, which
privacy violations, and political manipulation. In recent years, many highlight the most important pixels for deepfake detection algorithms
deep learning-based deepfake detection models (Rossler et al., 2019; (Alqaraawi et al., 2020). Typically, these approaches augment the
Chollet, 2017; Afchar et al., 2018; Zhou et al., 2017; Qian et al., 2020; original detection model with class activation map modules, displaying
Zhao et al., 2021a) have been proposed, demonstrating significant the regions of interest through heatmaps to show the areas the model
detection accuracy. However, limited by the black-box nature of deep focuses on when making decisions. However, the highlighted areas are
learning, these models struggle to explain their detection results, i.e., not necessarily the actual forged regions, and these decision areas can
* Corresponding author. Institute of Information Engineering, Chinese Academic of Sciences, Beijing, China.
** Corresponding author.
E-mail addresses: [email protected] (M. Yu), [email protected] (B. Li).
https://doi.org/10.1016/j.fsidi.2024.301808
Fig. 1. Image disentanglement. The real content is visualized as real-map. The fake content is visualized as fake-map. The fake-map of real image and fake image
is different.
also be highlighted in real facial images, making it impossible to In this paper, the real and forged contents are visualized as real-map
distinguish them from regions in fake faces. Therefore, they cannot be and fake-map respectively. These maps have the same size as the input
used as evidence of forgery. 2) Forgery Clue-Based methods, which image. Ideally, the real-map of a real image is the image itself, and the
identify artifacts (Li et al., 2022; Hua et al., 2023; Zhao et al., 2021b), fake-map is a zero image. For a forged image, the real-map and fake-map
splicing traces (Li et al., 2020a), noise (Wang and Chow, 2023), and are two non-zero maps that, when combined, represent the original
other features in images as evidence of forgery, providing a certain level forged image. In order to learn the two maps of an image, we propose a
of explanation for the model’s decisions. However, these methods have Two-branch Autoencoder Network (TAENet) to decouple real and
limitations. For example, Hua et al. (2023) proposes an interpretability forgery content features and visualize these features as real-map and
method for deepfake detection, but it mainly explains by visualizing the fake-map. TAENet is composed of Content Feature Disentanglement
traces of forgery and cannot explain the forged content. Zhao et al. (CFD), Content Map Generation (CMG), and Classification (C). (1) CFD
(2021b) explains forged regions by using the cue of the source feature learns hidden representations of real and forgery content features from
inconsistency within the forged images. However, when faced with an input image with dual encoder. A discriminator is employed to
high-quality forged images, both detection performance and interpret- distinguish these two features, achieving the disentanglement of real
ability decrease significantly. and forged content features. (2) CMG is implemented using dual
Li et al. (2020a) cannot detect or explain forged images that do not decoder. To obtain the real-map and fake-map, we design a Pixel-level
involve blending operations. Li et al. (2022) and Wang and Chow (2023) Content Map Generation Loss (PCMGL) to guide the dual decoder in
are not capable of explaining high-quality forged images. Although the generating accurate real-map and fake-map. (3) Classification is
above explanation methods can, to some extent, indicate the presence of composed of an Auxiliary Classifier (AC) and a Prediction Classifier (C).
forgery, they cannot pinpoint the specific forged regions. We believe The difference between real and fake images exists not only in fake
that, while ensuring accuracy, being able to distinguish between forged maps, but also in real maps. Therefore, we introduce an auxiliary clas-
and non-forged contents and answer the question of where the forgery sifier to distinguish real latent features extracted from real and fake
occurred would provide a better interpretability. images, which can further improve the accuracy of detection and map
To address the aforementioned issues, we aim to construct a deep- estimation. Therefore, TAENet maintains high accuracy while predicting
fake detection framework that, without sacrificing accuracy, can split an forged and real contents, providing visual and interpretable evidence for
image into forged and non-forged parts, which represent the forgery- deepfake detection.
related and forgery-irrelevant content respectively, thereby providing Our contributions can be summarized as follows.
interpretability for the detection. Our approach is inspired by disen-
tangled representation learning (Wang et al., 2022b), which can ● We propose a novel interpretable deepfake detection framework
decouple an image and extract the target contents. Specifically, any named Two-branch Autoencoder Network (TAENet), which can
image can be disentangled into real and forged contents. Particularly, disentangle the real and forged contents from an input image, to gain
the forged content of a real image is empty, which can be considered as a better results for deepfake detection and provide convincing evi-
zero-image. Thus, the forged content of deepfake and real images exhibit dence through visualizing real and fake contents.
significant differences. If these differences can be extracted and visual- ● The proposed Pixel-level Content Map Generation Loss (PCMGL) is
ized, it would allow us to distinguish between real and fake images while designed for efficient pixel-level supervised training to obtain accu-
simultaneously providing interpretability. Fig. 1 illustrates the differ- rate estimates of fake and real contents.
ences in forged content between real and deepfakes. ● Extensive experiments demonstrate the effectiveness of our approach
However, separating the real and forged content of a face is not a in interpreting face forgery detection with accuracy guarantee.
trivial task. Lacking the labels of the forged regions makes it difficult to
extract features of forgery and non-forgery content. Additionally, visu- The remainder of this paper is organized as follows: In section 2, we
alizing the contents to provide interpretability for the detection poses provide a brief review of related works. Section 3 presents the details of
another challenge. the proposed Two-branch Autoencoder Network (TAENet) framework.
2
F. Du et al. Forensic Science International: Digital Investigation 50 (2024) 301808
Fig. 2. The overview framework of our proposed method. The framework consists of Content Feature Disentanglement (CFD), Content Map Generation (CMG) and
Classification. The CFD extracts latent features of real and forged content with dual encoder and feature discriminator. In the CMG, the dual decoder generate
interpretable real-map and fake-map for deepfake detection. A Pixel-level Content Map Generation Loss (PCMGL) is designed to facilitate the learning of CMG.
Finally, the prediction of the model is created by Classifier (C).
To assess the effectiveness of our approach, we perform experiments in Map-based methods (Alqaraawi et al., 2020; Hua et al., 2023) and
section 4. Finally, section 5 conclude this work. Forgery Clue-Based methods (Li et al., 2020a, 2022; Wang and Chow,
2023). However, saliency maps only indicate the regions the model fo-
2. Related work cuses on, which may not necessarily relate to the forgery. Forgery
Clue-Based methods rely on image artifacts or splicing traces, which are
2.1. Deepfake detection challenging to identify in high-quality forged images. Thus, this paper
proposes a method to decouple the real and fake content in images and
Deepfake detection has become a crucial topic of research due to the visualize these content to achieve interpretable deepfake detection.
increasing prevalence and sophistication of deepfake technologies. To
address this problem, various methods have been researched to identify 3. The proposed approach
manipulated faces. Early work focused on extracting handcrafted fea-
tures (Bai et al., 2023), such as blinking (Jung et al., 2020), head in- In this section, we present our approach of Two-branch Autoencoder
consistencies (Yang et al., 2019), and visual artifacts (Bappy et al., 2019; Network for Interpretable Deepfake Detection. Here, we first give a brief
Li and Lyu, 2018). As deepfake technology has advanced, these features overview of the problem formulation and then provide a detailed
have become increasingly difficult to detect. Researchers have applied description of the approach.
deep learning techniques to deepfake detection tasks, achieving notable
results. These methods are mainly divided into detection methods based
on spatial artifacts (Afchar et al., 2018; Zhao et al., 2021a) and those 3.1. Problem formulation
based on frequency domain artifacts (Qian et al., 2020; Frank et al.,
2020; Li et al., 2021). However, these methods fail when encountering We define an image as composed of real content and fake content,
high-quality forged images. More importantly, deep learning has a formalized as:
black-box nature, making it impossible to know how the model makes I = Ir + If (1)
decisions, leading to a lack of interpretability in deepfake detection
models and making them difficult to apply in practice. This paper pro- where I is an image, Ir is real content of the image, and If is fake content
poses an interpretable deepfake detection method to address this of the image. Specifically, the fake content of an real image is considered
shortcoming. as empty, denoted as a zero-image. Additionally, we define implicit
representation of the image into real features and fake features. Real
features are the implicit representation of real content, while fake fea-
2.2. Model interpretability
tures are the implicit representation of fake content. For a fogery image,
real features are related to the parts that are irrelevant to the forgery,
The interpretability of the model aims to describe the internal
such as the image background. We define the image domain as I ⊆
working mechanism of deep neural networks in understandable terms to
Xn×n×3, where n represents the image size. Our goal is to learn an
humans (Hua et al., 2023). To achieve this goal, numerous works have
interpretable deepfake detection model F(I; w) that, while maintaining
been proposed (Cheng et al., 2020; Zhang et al., 2019, 2020). However,
high accuracy, decouples the real part Ir and the fake part If of the image.
these works cannot be directly applied to the deepfake detection task
This is formalized as:
(Hua et al., 2023). The interpretability of deepfake detection is chal-
lenging (Wang et al., 2022a). The main aspects of interpretability in F(I, w) = Mr , Mf , ̂
y, I, Mr , Mf ⊆ X (2)
deepfake detection involve answering why an image is judged as fake
and identifying which parts of the image. Only by understanding the where w is the model parameters, Mr is the visualized image of Ir (real-
decision mechanism of deepfake detection can we deploy these models map), Mf is the visualized image of If (fake-map), and ̂
y represents the
effectively. Therefore, the interpretability of deepfake detection is predicted result of the input image.
crucial. Some methods have addressed this issue, such as Saliency To achieve the above objectives, we propose a two-branch
3
F. Du et al. Forensic Science International: Digital Investigation 50 (2024) 301808
autoencoder network (TAENet). TAENet consists of Content Feature real and forged content maps by the dual decoder. According to Equa-
Disentanglement (CFD), Content Map Generation (CMG), and Classifi- tion (1), we divide an input image into real content and forged content.
cation (C). The learning process of the model is shown in Fig. 2. CFD Through our model, the real content is visualized as real-map, and the
extracts real content features and fake content features. CMG visual forged content is visualized as fake-map. For an real image, we know
results of real content and fake content, offering interpretable detection. clearly that the real-map should be the image itself, while the fake-map
The prediction result of the image is given by Classification. should be a zero map. For a forged image, since both the real and forged
contents are unknown, constraints cannot be set for real-map and fake-
3.2. Content Feature Disentanglement map. Additionally, for any image, whether real or forged, the sum of
real-map and fake-map should be the image itself. Therefore, the PCMGL
The purpose of Content Feature Disentanglement (CFD) is to extract is formalized as:
and separate real and fake features of the image, solving the problem of Lg = Lf + Lr + Lrecon (8)
highly coupled features between real and fake content. As shown in
Fig. 2, CFD consists of dual-branch encoder and feature discriminator. Here, Lf and Lr are pixel-level L1 loss functions, which constrain the
For an input image, the dual-branch encoder extracts real content fea- real-map and fake-map of an real image, formalized as:
tures and fake content features, respectively. Then, the feature Lf = ‖Mr − Ireal ‖1 (9)
discriminator is used to distinguish between these two features.
Dual Encoder. The dual-branch encoder Er and Ef employ the same Lr = ‖Mf ‖1 (10)
backbone based on CNN, and is used to extract the real content features
Fr and the fake content features Ff of an input image, respectively. Thus, where Ireal is an input real image, Mr is real-map of the real image,
Fr represents the implicit representation of real content, while Ff rep- and Mf is fake-map of the real image. Lrecon is an L2 loss function, which
resents the implicit representation of fake content. This can be formal- constrains the fake-map and real-map for any image, formalized as:
ized as: Lrecon = ‖Mr + Mf − I‖2 (11)
Fr = Er (I) (3) where I is the input image, and Mr and Mf are the real-map and fake-
map of the image, respectively.
Ff = Ef (I) (4)
3.4. Classification
where I is the input image.
Feature Discriminator. After the dual encoder, we obtain real
Previous work has shown that it is possible to decouple the real
content features Fr and fake content features Ff. Since the real content
content and fake content for an input image. However, we also need to
features and fake content features are different, we label all the features
consider how to predict the authenticity of the input image. Considering
extracted from the Er encoder with the label "1", and label the features
that there are significant differences in fake-maps between real and
extracted from the Ef encoder with the label "0". In order to supervise
forged image, the corresponding forged content features have charac-
these two types of features, we introduce a feature discriminator D to
teristics that can distinguish between real and forged images. Therefore,
determine the classes of Fr and Ff. The discriminator is a fully connected
we introduce a classifier to distinguish these differences and detect the
classifier. We use binary cross-entropy loss Ld to calculate the cross-
authenticity of images. Similarly, the real content features of real and
entropy, which encourages the model to decouple the real content fea-
forged images are also different. Leveraging this characteristic, we
tures and fake content features. Ld is formalized as:
introduce an auxiliary classifier.
Ld = Lrce (Dr (Fr ), 1) + Lfce (Df (Ff ), 0) (5) Auxiliary Classifier. The auxiliary classifier (Caux) is a binary clas-
sifier implemented by fully connected layers, aiming to improve the
where Lrce represents the cross-entropy loss of real content features, and accuracy of predicting feature maps during the training phase. The input
Lfce represents the cross-entropy loss of fake content features. is the real content features, and the output is the prediction result for the
input image. We use binary cross-entropy loss in the training phase, as
follows:
3.3. Content Map Generation
Laux = Lce (Caux (Fr ), y) (12)
We utilize the decoupled real content features and forged content
features from the input image to generate a real content map and a where Fr is real content features of an input image, y is ground truth, and
forged content map. This is a key aspect of interpretability in our Lce is the cross-entropy loss function.
approach. In this module, we consider employing dual decoder to Prediction Classifier. As mentioned earlier, we draw inspiration
generate maps. from the differences in forged content between real and fake images as a
Dual Decoder. The dual decoder consist of two same decoders, basis for our model to judge the authenticity of images. We use the same
namely the Real Content Map Decoder Dr and the Fake Content Map fully connected layers structure as the auxiliary classifier to build a
Decoder Df. The input of dual decoder is the real content features Fr and prediction classifier, which serves as the final prediction of the model for
the fake content features Ff obtained from the previous module. Through the authenticity of the input image. The loss function for this classifier is:
upsampling and convolutional layers, Dr and Df respectively generate Lp = Lce (C(Ff ), y) (13)
the real content map (real-map) and the fake content map (fake-map),
with the same size as the input image. It can be formalized as: where Ff is forged content features of an input image, y is the ground
truth, and Lce is the cross-entropy loss function.
Mr = Dr (Fr ) (6)
where Mr is real-map and Mf is fake-map. The final loss function of the training phase is the weighted sum of
To generate interpretable fake-map that reflect the differences be- the above loss functions.
tween real and forged images, we designed a Pixel-level Content Map L = λ1 Ld + λ2 Lg + λ3 Laux + λ4 Lp
Generation Loss (PCMGL) to facilitate the generation of interpretable
4
F. Du et al. Forensic Science International: Digital Investigation 50 (2024) 301808
Table 1
Detection results (%) on DeepFakes, FaceSwap, Face2Face, and NeuralTextures.
Train Set Method DeepFakes FaceSwap Face2Face NeuralTextures
DeepFakes ResNet18 95.85 98.49 48.64 36.02 54.12 67.54 54.82 70.97
Ours 95.63 98.41 48.87 39.55 53.61 69.27 53.58 69.23
FaceSwap ResNet18 50.75 50.85 95.00 98.71 51.39 61.30 49.65 51.51
Ours 50.70 52.79 95.99 98.86 51.31 62.69 50.19 53.74
Face2Face ResNet18 54.43 71.22 50.73 52.31 96.01 98.29 51.63 63.47
Ours 54.92 73.06 50.82 46.78 95.79 98.40 51.11 60.98
NeuralTextures ResNet18 63.26 77.59 50.25 51.42 55.94 67.94 89.45 95.90
Ours 61.49 75.21 50.27 50.25 55.58 66.89 89.38 95.64
Here, λ1, λ2, λ3, λ4 are hyperparameters that balance training losses. Table 2
Empirically, we set λ1 = 1, λ2 = 1, λ3 = 1, λ4 = 1 during experiments. Comparison of accuracy (%) with competing methods on FF++ and Celeb-DF.
Method FF++ Celeb-DF
4. Experiments
AUC ACC AUC ACC
4.1. Experimental settings Meso4 (Afchar et al., 2018) 82.32 72.12 91.24 83.67
MesoInception4 (Afchar et al., 2018) 86.45 77.30 92.02 84.53
Xception (Rossler et al., 2019) 91.80 82.54 96.20 90.23
Datasets. To evaluate the effectiveness of our proposed method, we SPSL (Liu et al., 2021) 96.25 89.53 98.26 93.24
conducted experiments on three large-scale mainstream benchmark Ours 95.35 87.20 97.63 91.94
datasets: FaceForensics++ (FF++) (Rossler et al., 2019), Celeb-DF-v2
(Celeb-DF) (Li et al., 2020b), and DeepFake Detection Challenge
(DFDC) (Dolhansky et al., 2019). FF++ comprises 1000 real videos and Face2Face and NeuralTextures with pretrained ResNet18 on ImageNet.
4000 forged videos. The forged videos are generated using four different The results are presented in Table 1. We can see that our interpretable
methods: DeepFakes (DF) (Korshunov and Marcel, 2018), FaceSwap models are comparable to the baseline models on ACC and AUC. It can
(FS) (Rossler et al., 2019), Face2Face (F2F) (Thies et al., 2016), and be observed that our proposed TAENet maintains high accuracy
NeuralTextures (NT) (Thies et al., 2019). Each method corresponds to compared to the baseline models. This indicates that our model does not
1000 fake videos. We used the HQ version of the C23 compression from compromise the accuracy of the original baseline models and can pro-
FF++, and extracted 30 frames from each video. The training, valida- vide interpretability.
tion and testing sets are divided according to the official guidelines. Comparison of Accuracy with Competing Methods. To further
Celeb-DF consists of 590 real videos, with 390 used for training, 115 for assess the comprehensive detection capabilities of our framework, we
validation, and 115 for testing. There are 5639 forged videos. Each real reproduced four state-of-the-art methods, including Meso4 (Afchar
video randomly sampled for 5 frames, and each forged video randomly et al., 2018), MesoInception4 (Afchar et al., 2018), Xception (Rossler
sampled for 50 frames to balance the dataset labels. DFDC contains et al., 2019), and SPSL (Liu et al., 2021), under the same conditions. We
nearly 119,146 videos, with 19,154 real and 99,992 forged videos, trained these models on FF++, Celeb-DF and tested in-dataset, evalu-
which is divided into training, validation, and testing sets in 6:2:2. To ating by AUC and ACC. The experimental results are presented in
balance the labels, each real video is randomly sampled for 5 frames, Table 2. Specifically, our method is significantly better than Meso4,
and each forged video is randomly sampled for 1 frame. These datasets MesoInception4, and Xception, achieving the second highest AUC.
provide a diverse range of real and forged videos, allowing us to thor- Compared to the SOTA method (Liu et al., 2021), our method is
oughly evaluate the performance of our method across different sce- competitive, showcasing superior performance in deepfake detection. It
narios and challenges in deepfake detection. is evident that the proposed TAENet leads to excellent performance
Experimental Details. We use ResNet18 as the backbone (He et al., compared to other models in most cases.
2016). The backbone was trained on ImageNet. Face extraction and
alignment are performed using DLIB (Sagonas et al., 2016). The aligned 4.3. Interpretability performance
faces are resized to 224 × 224 for both training and testing. We use the
Adam (Kingma and Ba, 2014) for optimization with the learning rate of Accuracy and interpretability are two crucial capabilities of our
0.001, and the batch size is 128. proposed method. The above experiments demonstrate that our method
Evaluation Metrics. To evaluate the effectiveness of our approach, ensures accuracy. Next, we will evaluate the interpretability by visual-
we check both the detection performance and the interpretability per- izing the real content and the fake content.
formance with comprehensive metrics. We use area under curve (AUC) Visualization of Maps. We analyzed the interpretability of our
and accuracy (ACC) as detection evaluation metrics, which is consistent model on the Deepfakes, FaceSwap, Face2Face and NeuralTextures
with the evaluation approach adopted in previous works (Cao et al., datasets. We visualized the real content and forged content of input
2022; Liu et al., 2021). The interpretability evaluation is implemented images as real-maps and fake-maps. As shown in Fig. 3, in the real-map,
through the visualization analysis of maps. the non-black regions represent the real content of the image. In the
fake-map, the non-black regions represent the forged content of the
4.2. Detection performance image. From Fig. 3, it can be observed that the real map and the fake
map have a complementary relationship. The forged content of the input
Main Objective Accuracy. We proposed a framework (TAENet), image is decoupled into the fake-map and visualized, while the real
where different backbone networks of base models can serve as encoders content is decoupled into the real-map and visualized. For real images,
within this framework. Consequently, we initially evaluated the accu- the fake map is a zero-map, and the real map is similar to the original
racy of the baseline model (ResNet18) and its corresponding TAENet. To input image, which is consistent with reality. This indicates that the
achieve this goal, all models are trained on DeepFakes, FaceSwap, model decouples the real and forged content from real images. To verify
5
F. Du et al. Forensic Science International: Digital Investigation 50 (2024) 301808
Fig. 3. The real content and forged content of input images are visualized as real-maps and fake-maps. The non-black regions represent forged content in real-map.
The non-black regions represent forged content in fake-map. These maps provide interpretability for deepfake detection.
Fig. 4. Fake-maps of fake images from four forgery methods. (a) Deepfakes, (b) FaceSwap, (c) Face2Face, (d) NeuralTextures. Black regions represent forged content.
The forged contents we detected matched what was actually known.
the accuracy of the decoupling capability in fake images, we performed the extracted forged content covers most of the facial area, the
a qualitative visual analysis. remaining areas being real content. For images forged using the Deep-
We know that Face2Face is a facial reenactment technique that tar- fakes method, the extracted forged content is mainly concentrated in the
gets the entire face, so its forged content includes nearly the whole face. region between the eyebrows and the chin, closely matching the actual
NeuralTextures is a facial reenactment technique that targets the mouth forged content.
area, so its forged content is the mouth. FaceSwap is a face-swapping Therefore, for the input images, the real and forged content extracted
technique, and its forged content covers most of the facial area. Deep- by our model generally corresponds to the actual content.
fakes is also a face-swapping technique, with forged content covering Explanations. From the above analysis, it is evident that the model
most of the facial area, typically in a rectangular region from the eye- accurately decouples the forged and real content of input images. The
brows to the chin. forged content map and real content map are generated from the real
From Fig. 4, it can be seen that for images forged using the Face2Face content features and forged content features, indicating that the dual
method, the extracted forged content is concentrated in the facial area, encoder in the model successfully extract these features. Our model
while the real content includes non-facial regions such as hair and makes authenticity predictions based on the features of the forged
background. For images forged using the NeuralTextures method, the content, providing interpretability. Furthermore, in addition to the
extracted forged content is located in the mouth area, with the area significant differences in the fake content maps of real and fake images,
outside the mouth being real content, which aligns closely with the the real content maps (real-map) also exhibit notable differences. This
actual forged content. For images forged using the FaceSwap method, further provides evidence for deepfake detection. This demonstrates that
6
F. Du et al. Forensic Science International: Digital Investigation 50 (2024) 301808
Fig. 5. Ablation study on the NeuralTextures dataset, examining the impact of removing different components from the proposed method on interpretability. (a)
without D, (b) without AC, (c) without PCMGL, (d) is our proposed method. The mouth is manipulated by NeuralTextures in face image.
Table 3
Cross-dataset evaluation accuracy (%) on FF++, Celeb-DF, and DFDC.
Train Set FF++ Celeb-DF DFDC Avg.
7
F. Du et al. Forensic Science International: Digital Investigation 50 (2024) 301808
models while maintaining interpretability. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 6458–6467.
Li, L., Bao, J., Zhang, T., Yang, H., Chen, D., Wen, F., Guo, B., 2020a. Face x-ray for more
Acknowledgment general face forgery detection. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp. 5001–5010.
This work was supported by the National Key R&D Program of China Li, X., Ni, R., Yang, P., Fu, Z., Zhao, Y., 2022. Artifacts-disentangled adversarial learning
for deepfake detection. IEEE Trans. Circ. Syst. Video Technol. 33, 1658–1670.
(Grant No. 2021YFF0602104). Li, Y., Lyu, S., 2018. Exposing Deepfake Videos by Detecting Face Warping Artifacts
arXiv preprint arXiv:1811.00656.
References Li, Y., Yang, X., Sun, P., Qi, H., Lyu, S., 2020b. Celeb-df: a large-scale challenging dataset
for deepfake forensics. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 3207–3216.
Afchar, D., Nozick, V., Yamagishi, J., Echizen, I., 2018. Mesonet: a compact facial video Liu, H., Li, X., Zhou, W., Chen, Y., He, Y., Xue, H., Zhang, W., Yu, N., 2021. Spatial-phase
forgery detection network. In: 2018 IEEE International Workshop on Information shallow learning: rethinking face forgery detection in frequency domain. In:
Forensics and Security (WIFS). IEEE, pp. 1–7. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Alqaraawi, A., Schuessler, M., Weiß, P., Costanza, E., Berthouze, N., 2020. Evaluating Recognition, pp. 772–781.
saliency map explanations for convolutional neural networks: a user study. In: Qian, Y., Yin, G., Sheng, L., Chen, Z., Shao, J., 2020. Thinking in frequency: face forgery
Proceedings of the 25th International Conference on Intelligent User Interfaces, detection by mining frequency-aware clues. In: European Conference on Computer
pp. 275–285. Vision. Springer, pp. 86–103.
Badrinarayanan, V., Kendall, A., Cipolla, R., 2017. Segnet: a deep convolutional encoder- Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M., 2019.
decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. Faceforensics++: learning to detect manipulated facial images. In: Proceedings of
39, 2481–2495. the IEEE/CVF International Conference on Computer Vision, pp. 1–11.
Bai, W., Liu, Y., Zhang, Z., Li, B., Hu, W., 2023. Aunet: learning relations between action Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, S., Pantic, M., 2016. 300 faces
units for face forgery detection. In: Proceedings of the IEEE/CVF Conference on in-the-wild challenge: Database and results. Image Vis Comput. 47, 3–18.
Computer Vision and Pattern Recognition, pp. 24709–24719. Thies, J., Zollhöfer, M., Nießner, M., 2019. Deferred neural rendering: image synthesis
Bappy, J.H., Simons, C., Nataraj, L., Manjunath, B., Roy-Chowdhury, A.K., 2019. Hybrid using neural textures. ACM Trans. Graph. 38, 1–12.
lstm and encoder–decoder architecture for detection of image forgeries. IEEE Trans. Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., Nießner, M., 2016. Face2face: real-
Image Process. 28, 3286–3300. time face capture and reenactment of rgb videos. In: Proceedings of the IEEE
Cao, J., Ma, C., Yao, T., Chen, S., Ding, S., Yang, X., 2022. End-to-end reconstruction- Conference on Computer Vision and Pattern Recognition, pp. 2387–2395.
classification learning for face forgery detection. In: Proceedings of the IEEE/CVF Wang, T., Chow, K.P., 2023. Noise based deepfake detection via multi-head relative-
Conference on Computer Vision and Pattern Recognition, pp. 4113–4122. interaction. In: Proceedings of the AAAI Conference on Artificial Intelligence,
Cheng, K., Wang, N., Li, M., 2020. Interpretability of deep learning: a survey. In: The pp. 14548–14556.
International Conference on Natural Computation, Fuzzy Systems and Knowledge Wang, T., Liao, X., Chow, K.P., Lin, X., Wang, Y., 2022a. Deepfake Detection: A
Discovery. Springer, pp. 475–486. Comprehensive Study from the Reliability Perspective arXiv preprint arXiv:
Chollet, F., 2017. Xception: deep learning with depthwise separable convolutions. In: 2211.10881.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Wang, Y.C., Wang, C.Y., Lai, S.H., 2022b. Disentangled representation with dual-stage
pp. 1251–1258. feature learning for face anti-spoofing. In: Proceedings of the IEEE/CVF Winter
Dolhansky, B., Howes, R., Pflaum, B., Baram, N., Ferrer, C.C., 2019. The Deepfake Conference on Applications of Computer Vision, pp. 1955–1964.
Detection Challenge (Dfdc) Preview Dataset arXiv preprint arXiv:1910.08854. Yang, X., Li, Y., Lyu, S., 2019. Exposing deep fakes using inconsistent head poses. In:
Frank, J., Eisenhofer, T., Schönherr, L., Fischer, A., Kolossa, D., Holz, T., 2020. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal
Leveraging frequency analysis for deep fake image recognition. In: International Processing (ICASSP). IEEE, pp. 8261–8265.
Conference on Machine Learning. PMLR, pp. 3247–3258. Zhang, C., Liu, A., Liu, X., Xu, Y., Yu, H., Ma, Y., Li, T., 2020. Interpreting and improving
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., adversarial robustness of deep neural networks with neuron sensitivity. IEEE Trans.
Courville, A., Bengio, Y., 2020. Generative adversarial networks. Commun. ACM 63, Image Process. 30, 1291–1304.
139–144. Zhang, Q., Yang, Y., Ma, H., Wu, Y.N., 2019. Interpreting cnns via decision trees. In:
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Recognition, pp. 6261–6270.
pp. 770–778. Zhao, H., Zhou, W., Chen, D., Wei, T., Zhang, W., Yu, N., 2021a. Multi-attentional
Hua, Y., Shi, R., Wang, P., Ge, S., 2023. Learning patch-channel correspondence for deepfake detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision
interpretable face forgery detection. IEEE Trans. Image Process. 32, 1668–1680. and Pattern Recognition, pp. 2185–2194.
Jung, T., Kim, S., Kim, K., 2020. Deepvision: deepfakes detection using human eye Zhao, T., Xu, X., Xu, M., Ding, H., Xiong, Y., Xia, W., 2021b. Learning self-consistency for
blinking pattern. IEEE Access 8, 83144–83154. deepfake detection. In: Proceedings of the IEEE/CVF International Conference on
Kingma, D.P., Ba, J., 2014. Adam: A Method for Stochastic Optimization arXiv preprint Computer Vision, pp. 15023–15033.
arXiv:1412.6980. Zhou, P., Han, X., Morariu, V.I., Davis, L.S., 2017. Two-stream neural networks for
Korshunov, P., Marcel, S., 2018. Deepfakes: a New Threat to Face Recognition? tampered face detection. In: 2017 IEEE Conference on Computer Vision and Pattern
Assessment and Detection arXiv preprint arXiv:1812.08685. Recognition Workshops (CVPRW). IEEE, pp. 1831–1839.
Li, J., Xie, H., Li, J., Wang, Z., Zhang, Y., 2021. Frequency-aware discriminative feature
learning supervised by single-center loss for face forgery detection. In: Proceedings