Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
19 views22 pages

Deepfake Video Detection

This systematic review examines the challenges of generalizing DeepFake detection techniques, highlighting that while existing methods show potential for generalization, many models tend to overfit to specific datasets. The authors emphasize the need for more diverse datasets to improve the adaptability of detection algorithms to unseen DeepFake types. The review also calls for future research to address these gaps and enhance the robustness of detection methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views22 pages

Deepfake Video Detection

This systematic review examines the challenges of generalizing DeepFake detection techniques, highlighting that while existing methods show potential for generalization, many models tend to overfit to specific datasets. The authors emphasize the need for more diverse datasets to improve the adaptability of detection algorithms to unseen DeepFake types. The review also calls for future research to address these gaps and enhance the robustness of detection methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

xxx (xxxx) xxx

Contents lists available at ScienceDirect

Data and Information Management


journal homepage: www.journals.elsevier.com/data-and-information-management

DeepFake video detection: Insights into model generalisation — A Systematic


review

Ramcharan Ramanaharan, Deepani B. Guruge , Johnson I. Agbinya
Melbourne Institute of Technology, 288 La Trobe Street, Melbourne, VIC 3000, Australia

ARTICLE INFO ABSTRACT

Keywords: Deep learning generative models have progressed to a stage where distinguishing fake images and videos has
DeepFake become difficult, posing risks to personal integrity, potentially leading to social instability, and disrupting
Detection government functioning. Existing reviews have mainly focused on the approaches used to detect DeepFakes,
Generalisability
and the data sets used for those approaches. However, challenges persist when attempting to generalise
Systematic review
detection techniques to identify previously unseen datasets. The purpose of this systematic review is to explore
Machine learning
state-of-the-art frameworks for DeepFake detection and provide readers with an understanding of the strengths
and weaknesses of current approaches, as well as the generalisability of existing detection techniques. The study
indicates that generalising DeepFake detection remains a challenge that requires further research. Moreover,
46.3% of the selected publications agreed that DeepFake detection techniques could be generalised to identify
various types of DeepFakes. A key limitation in achieving generalisation is the tendency of models to overfit
to available data datasets, reducing their effectiveness in adapting to new or unseen types of DeepFakes. This
review emphasises the need for the development of extensive and diverse datasets that more accurately reflect
the wide range of DeepFake manipulations encountered in real-world applications. Lastly, the paper explores
potential advancements that could pave the way to the next generation of solutions against DeepFakes.

1. Introduction essential to develop generalisable DF detection techniques as well as to


establish strict guidelines or regulations to mitigate the creation and use
Advancements in the field of artificial intelligence (AI) are pro- of DeepFake images, audio, or video (Gambini, Fagni, Falchi, & Tesconi,
gressing at an unprecedented pace, presenting complex challenges that 2022). Researchers and technology experts strive to develop reliable
demand careful consideration. A prominent example is the genera- methods to detect DFs to mitigate this threat (Bondi, Cannas, Bestagini,
tion of synthetic audiovisual content that is nearly indistinguishable & Tubaro, 2020; Suratkar, Kazi, Sakhalkar, Abhyankar & Kshirsagar,
from authentic material (Jada & Mayayise, 2024; Lomnitz, Hampel- 2020; Zi, Chang, Chen, Ma, & Jiang, 2020). Detecting DFs has become
Arias, Sandesara, & Hu, 2020). This advancement is largely powered an increasingly challenging task due to the rapid advancements in
by cutting-edge AI technologies, including auto-encoders and gener- Generative AI (GenAI) algorithms, which malicious users exploit to
ative adversarial networks (GANs) (Gong, Goh, Kumar, Ye, & Chi, produce more authentic DFs (Bansal et al., 2023a; Coccomini, Caldelli,
2020; Ismail, Elpeltagy, Zaki, Eldahshan, Kamal, 2021a). The devel- Falchi, Gennaro, & Amato, 2022; Li et al., 2020; Zhao, Zhang, Ding,
opment and dissemination of DeepFakes (DF) have become a critical
& Cui, 2021). Most current research relies heavily on well-known data
area of concern. DeepFakes are fake videos, audio, or images cre-
sets, including FF++, the DeepFake Detection Challenge (DFDC), and
ated using deep learning algorithms, often with malicious intent to
Celeb-DF (Du, Pentyala, Li, & Hu, 2020). Although these data sets
harm individuals. These falsified media can severely damage someone’s
are valuable, they do not cover the full range of DFs encountered in
reputation, influence public opinion on significant topics, influence
real-world scenarios (Du et al., 2020; Malik, Kuribayashi, Abdullahi,
election results (Agarwal, Farid, El-Gaaly, & Lim, 2020; Deng, Suo, Li,
& Khan, 2022). This reliance on a limited number of datasets causes
et al., 2022), posing substantial threats to personal privacy and public
a significant challenge in developing robust detection methods. This
safety (Singh, Saimbhi, Singh, & Mittal, 2020).
Furthermore, widely available and user-friendly apps such as systematic review of the literature aims to investigate the generalis-
FakeApp and DeepFaceLab can lead to increased abuse, which is quite ability of existing DeepFake detection techniques in published research
alarming (Liu et al., 2023). We have reached a point where it is studies.

∗ Corresponding author.
E-mail addresses: [email protected] (R. Ramanaharan), [email protected] (D.B. Guruge), [email protected] (J.I. Agbinya).

https://doi.org/10.1016/j.dim.2025.100099
Received 15 October 2024; Received in revised form 12 March 2025; Accepted 13 March 2025
2543-9251/© 2025 The Authors. Published by Elsevier Ltd on behalf of School of Information Management Wuhan University. This is an open access article under
the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Please cite this article as: Ramcharan Ramanaharan et al., , https://doi.org/10.1016/j.dim.2025.100099


R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

This paper is organised as follows: Section 2 reviews related sys- Existing systematic reviews in the field (Table 1) of DeepFake
tematic reviews published between 2022 February and 2024 August. detection exhibit significant limitations. These reviews fail to critically
Section 3 outlines the methodology used in this systematic literature evaluate the main features, strengths, and weaknesses of the proposed
review (SLR), while Section 4 covers the findings of this SLR. Section 5 techniques in the selected publications. Additionally, most do not inves-
is a discussion, and Section 6 concludes and highlights future work. tigate the authors views on the adaptability of their proposed models
across diverse datasets and strengths and weaknesses toward develop-
2. Background and related work ing them to generalisable models. As a result, the generalisability of the
current models and their best practices require further investigation to
Researchers have conducted comprehensive systematic reviews to provide clearer insights for the DeepFake research community.
assess the current state of DeepFake detection research, while some This systematic review aims to address these gaps by offering a
studies have focused on conducting surveys of DeepFake detection comprehensive evaluation of existing video DeepFake detection mod-
techniques, providing a broader exploration of the methodologies and els. It will analyse their strengths, distinctive features, and limitations
advancement in this domain. For example, a survey conducted by Ma- while identifying the challenges associated with generalising these
sood et al. discusses the primary challenges associated with DeepFake models to unseen manipulation types. Such insights will contribute to
detection and proposes potential countermeasures to address them. The a more holistic understanding of the field, providing valuable informa-
authors categorised visual DF into five distinct categories based on the tion on the key features and algorithms of tested models in a single,
level of manipulation: face swap, lip syncing, puppet-mastery, complete consolidated table, facilitating to design more robust and effective
algorithms.
face synthesis, and facial attribute manipulation. They also provide
Finally, Section 5 offers recommendations for future research, ad-
insights into the challenges that arise from post-processing operations
dressing challenges and incorporating the latest technological advance-
such as noise effects, compressions, and lighting variations (Masood
ment in video DeepFake detection research. These recommended ap-
et al., 2023). Researchers in Kaur, Noori Hoshyar, Saikrishna, Firmin,
proaches aim to contribute to the development of more robust and
and Xia (2024), Rana, Nobi, Murali, and Sung (2022) organise Deep-
generalisable detection models in future.
Fake research into four groups: methods based on machine learning,
deep learning, statistical measurement, and blockchain technologies.
3. Systematic literature review (SLR) methodology
At the conclusion of the review conducted by Kaur et al. (2024)
highlighted hybrid methods hold significant potential to achieve high
This systematic review aims to explore existing DF detection tech-
classification accuracy for fake videos in real-time. These categories
niques and provide the reader with an understanding of the strengths
reflect the diverse techniques used to create and detect DF, emphasising
and weaknesses of current approaches, as well as the possibility of
the complexity of addressing this evolving technology. The review
generalising these existing approaches. In particular, the contribution
in Malik et al. (2022) provides insights into challenges in generalising
of this systematic review is threefold:
existing DF detection techniques. These include; a lack of comprehen-
Investigation of detection techniques: An in-depth examination
sive DeepFake datasets, unknown types of attacks, inter-frame temporal
of the various algorithms used for the detection of DF is conducted
consistency issues, and additional efforts required for labelling (or
to understand how these techniques can be applied in various con-
adding scores corresponding to the type of forgery in datasets). Mirsky
texts. These algorithms are evaluated with respect to their strengths,
and Lee (2021) proposed unpaired self-supervised training techniques
weaknesses, and accuracy in detecting DF, as shown in Table 3.
to reduce the need for large training datasets in deep learning algo-
Analysis and classification: Investigation and categorisation of the
rithms. Addressing these challenges will be crucial to developing more approaches identified in the reviewed literature to identify the most
effective and generalisable DeepFake detection techniques. widely used deep fake detection technique, as shown in Fig. 6.
Table 1 summarises eight previous systematic reviews on Deep- Generalisability of detection algorithms: The algorithms applied
Fake detection techniques published between 2022 February and 2024 in the reviewed literature are analysed in terms of the potential to
August. These reviews focus primarily on the effectiveness of exist- generalise across different datasets and real-life applications. This as-
ing DeepFake detection techniques (Passos et al., 2024; Rana et al., sessment is critical to understanding the gaps in existing research and
2022; Stroebel, Llewellyn, Hartley, Ip, & Ahmed, 2023), rather than on guiding researchers towards solutions that can be applied in a variety
their generalisability. A systematic review conducted by Stroebel et al. of external environments. Finally, the paper discusses potential future
(2023) explored the challenges associated with detecting DF across advances that could lead to the next generation of solutions against DF.
various modalities, including audio, images, and videos, and did not To conduct a thorough and systematic investigation, we followed
delve deeply into video-specific DeepFake detection techniques. It high- established principles and procedures for systematic reviews. The pro-
lighted that the combination (CNN, DNN and LSTM) producing ensem- posed SLR process is illustrated in Fig. 1 and includes four essential
bled and multi-attentional architectures will strengthen the detection steps: Formulate research questions, establish a systematic review pro-
techniques. A review by Rana et al. (2022) categorise detection meth- cess, review and evaluate literature, as well as data analysis and
ods into three primary types: naive detectors, spatial detectors, and presentation of results.
frequency detectors. In particular, their review reveals that 77% of the
studies used deep learning-based approaches for detecting DF. Misirlis 3.1. Research questions
and Munawar (2023) reviewed 41 documents and discussed the risks,
threats, and ethical considerations associated with DeepFake technol- The formulation of research questions plays a vital role in deter-
ogy. However, their discussion lacks an analysis of the generalisation of mining the success of any systematic literature review (SLR). These
DeepFake models, and the major issues and drawbacks in the existing questions are the general framework of the review as they determine
detection models. how the available studies are selected, appraised, and summarised.
The authors in Sharma, Garg, and Caudron (2024) reviewed the Therefore, in the context of the dynamically evolving and highly com-
datasets and features used to detect DF and proposed six prominent petitive area of DeepFake detection, it is becoming crucial that the
DeepFake detection techniques; XceptionNet, ResNet-50, VGG16, Cap- review to focus on the core questions that need to be addressed.
sule Networks, 3D-CNN, EfficientNet-V2. Passos et al. (2024) evaluated As for the questions of this review, we have posed specific questions
the detection of DF using deep learning-based approaches. The au- that aim to identify what has been done in the DeepFake detection field,
thors emphasised that the combination of supervised and unsupervised how well the existing solutions work, and whether the approaches are
learning techniques will support handling rapidly evolving complex DF. uniform across different domains. These questions are intended to help

2
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

Table 1
Related systematic reviews conducted between 2022–2024.
Paper Reviewed Published Advantages Limitations
Rana et al. (2022) 2018 Jan–2020 Dec 2022 Feb Discussed Widely used DDT, There is no discussion about
features, and datasets utilised strengths and weakness, and
generalisability of the DF
detection models
Stroebel et al. (2023) 2021-Aug 2022 2023 Mar Discussed current DDT and There is no discussion about
proposed hybrid model (CNN, strengths and weakness, and
DNN, LTSM)and ensemble and generalisability of the DF
multi-attentional as best detection models
architectures and uniform rating
system to validate these
techniques
Sharma et al. (2024) 2017–2023 2024 Aug Evaluates DeepFake detection There is no discussion about
methods by discussing strengths and weakness, and
manipulations, optimisations, and generalisability of the DF
enhancements and proposed 6 detection models
prominent DDT: XceptionNet,
ResNet-50, VGG16, Capsule
Networks, 3DCNN,
EfficientNet-V2, proposed for
future research
Whittaker, Mulcahy, Letheren, 2017 June 2021 2023 May Establishing firmer conceptual There is no discussion about
Kietzmann, and Russell-Bennett understanding of DF and strengths and weakness, and
(2023) implications for innovation generalisability of the DF
detection models
Heidari, Jafari Navimipour, Dag, 2021–2020 2023 Nov How DF are generated, identified, There is no discussion about
and Unal (2024) latest developments, areas generalisability of the DF
requiring more investigation detection models, a brief
discussion about challenges and
advantages of detection models
Misirlis and Munawar (2023) 2018–2019 2023 Risks and threats of DF Generalisation is not discussed
Vasist and Krishnan (2022) Nov 2021–Apr 2022 2022 Aug Gaps in DF definitions, theoretical generalisation is not discussed
foundations,
Passos et al. (2024) 2018–2024 2023 Oct Current deep learning based There is no detailed discussion
DDT& future directions for about generalisability of the DF
further studies detection models

Fig. 1. Systematic literature review process.

gain a broad perspective on the subject and to assess the effectiveness 3.2.1. Data sources
of the currently used techniques. In this way, we want to provide rec- This systematic review was conducted using publications from three
ommendations for future advances that may benefit the development databases of scientific articles covering the period from January 2018
of more effective and comprehensive systems for DeepFake detection. to February 2024: Google Scholar (accessed October 2023), IEEE Xplore
Table 2 provides a list of research questions aiming to cover every (accessed October 2023), ACM Digital Library (October 2023). The in-
aspect of DeepFake detection, including methodological approaches clusion of Google Scholar was based on the recognition in the research
literature that it is essential to ensure comprehensive and efficient
and practical applications. These questions confirm the extent of the
coverage.
review and the crucial aspects of the objective of detection strategies.
The distribution of publication types in the reviewed literature
highlights the predominance of journal articles over conference papers.
3.2. Establishing a systematic review protocol As shown in Fig. 2, 54.2%, of the publications are journal articles, while
45.8% are conference articles. This distribution suggests a substantial
focus on DeepFake detection research within the academic commu-
There has been a significant increase in interest in DeepFake tech-
nity, with a significant portion of findings being disseminated through
nology in recent times, which has resulted in the rise of a large number
peer-reviewed journals. The balance between journal and conference
of research papers. This comprehensive review aims to compile the
publications also indicates the dynamic nature of the field, where on-
most recent and innovative work published between January 2018 and going developments are frequently shared at conferences before being
February 2024. The primary goal of this review is to identify the latest published in more extensive journal articles.
trends in DeepFake creation and detection, which can provide valuable
insights for future researchers. The review objectives have been defined 3.2.2. Search strategy
and the appropriate search terms and publication selection criteria have In order to conduct a thorough search, various sources were con-
been identified, as discussed in the following sections. sulted instead of relying on one or two sources. Due to the wide range

3
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

Table 2
Research questions and their purposes.
Research Question Purpose
RQ1 Which state-of-the-art methods To identify and evaluate the most
are often used in the current effective techniques currently
literature on detecting DeepFake being employed for detecting
videos? DeepFake videos.
RQ2 What are the main issues and To understand the challenges and
constraints experienced by limitations faced in improving the
researchers and technologists accuracy and efficiency of video
when enhancing videos DeepFake DeepFake detection methods.
detection algorithms?
RQ3 What are the data sources utilised To analyse the datasets and
to evaluate the video DeepFake benchmarks used to test and
detection techniques? validate the performance of
detection algorithms.
RQ4 Can video DeepFake detection be To assess the adaptability and
generalised? robustness of detection algorithms
across different datasets and
real-world conditions.

3.3. Examining and evaluating the literature

3.3.1. Inclusion criteria


The following criteria were applied to include publications in the
selected collection.

• Publications from 2018 until February 2024


• Conference proceedings and journals
• Research focused on DeepFake detection techniques

Fig. 2. Type of the documents included in SLR.


3.3.2. Exclusion criteria
A set of exclusion criteria is also established to omit studies that
may not be relevant to this review.
of digital repositories, the search was limited to three main sources,
namely Google Scholar, ACM Digital Library, and IEEE Xplore Digital • Studies that did not have precise descriptions of the machine
Library. These were chosen because they contain extensive collections learning or deep learning models used in their research.
on digital forensics and machine learning.
• Review and survey papers: A discussion on the review papers
The next step is to choose the appropriate search terms. To gather
and survey papers is included in the related research section
as much relevant material as possible, a comprehensive approach was
but excluded from the systematic literature review document
taken. The objective was to ensure that 𝑁 significant research was
collection to maintain clear focus on primary studies and original
overlooked. Therefore, a wide range of search terms were used to min-
research contributions to DeepFake detection.
imise potential biases, effectively combining keywords using Boolean
• Research involving machine learning or deep learning solutions
search terms like ‘AND’ and ‘OR’. The initial search setup was broad,
to problems not related to DeepFake detection.
incorporating combinations such as: (DeepFake OR FaceSwap OR Video
• Technical reports, Articles without a full paper
manipulation OR Fake face/image/video) AND (detection OR detect)
• Master and PhD research thesis.
OR (Facial Manipulation OR Digital Media Forensics).
Throughout the search process, the terms used to investigate Deep-
Fake research were refined to better align with emerging trends and
specific areas of interest. Initially, the search terms included ‘‘DeepFake 3.3.3. Study selection
detection AND (digital forensics OR detection algorithms OR detection Fig. 3 shows the various phases of the document selection process in
methods) AND deep learning’’. As the search progressed, more specific this systematic review, developed according to the Preferred Reporting
terms were added, such as ‘‘DeepFake detection AND (digital forensics Items for Systematic Reviews and Meta-Analyses (PRISMA) (Page et al.,
OR detection algorithms OR detection methods) AND facial DF learning 2021). In the early stages of the literature review, a wide range of
OR deep learning OR video’’. Finally, the search concluded with a articles were obtained from various academic databases. 200 articles
highly targeted search string: were retrieved by a Google Scholar search. Other sources included the
‘‘DeepFake AND (generalisation in video forensics OR detection algo- ACM and IEEE Digital Libraries, each of which retrieved 100 articles,
rithms OR detection methods) AND Deep learning or Machine learning’’, for a total of 400 articles. These articles were reviewed with respect
which focused specifically on studies that address the generalisation to the main algorithm used, evaluation metrics, and the strengths
capabilities within deep learning frameworks for DeepFake detection. and weaknesses discussed in the paper. The review identified and
To ensure that the reviewed research was the most recent and innova- removed 36 duplicate entries, reducing the total to 364. These 256
tive and to capture significant advancements in the field, the search was articles were excluded from the review because they did not strictly
limited to the period of January 2018 to February 2024. This strategic focus on DeepFake detection. Hence, the selection was made into 108
approach allowed a thorough analysis of the current state of DeepFake articles that were directly related to the subject matter. This careful
detection, providing a solid foundation to comprehend the progress and selection ensured that the literature used was relevant, forming a strong
ongoing challenges in the field. foundation for a detailed analysis.

4
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

Fig. 3. PRISMA (Page et al., 2021).

4. Background on video deepFake detection techniques between a tampered region and an untampered region. It was discov-
ered that these transitions are strong indicators of frames that have
This section provides an overview of the key techniques and ap- undergone DeepFake manipulation, especially for the blend artifacts
proaches used in video DeepFake detection, specially focusing on ma- frames (Chang et al., 2020; Khalil & Maged, 2021; Siegel, Kraetzer,
nipulations within video DeepFakes. Detection methods are organised Seidlitz, & Dittmann, 2021).
into three categories: spatial methods (frame-level sensing), tempo- The second most important cue in spatial analysis is to notice
ral methods (sequence-level sensing), and multimodal methods (cross- discrepancies in colour and lighting. In distorted regions, manipulated
frames usually show unnatural illumination gradients, synthetic shad-
stream integration). While some methods focus on mitigating spatial
ows, or some reflections that are not consistent. Real and manipulated
biases in individual frames, others aim to smooth temporal artifacts
datasets train models to recognise these artificial effects while avoiding
across the entire video sequence. Further improvements are achieved
the effects of natural lighting conditions (Bondi et al., 2020; Coccomini,
by using various advanced architectures, such as transformers and
Caldelli, Falchi, & Gennaro, 2023; Siegel et al., 2021). DeepFake con-
hybrid frameworks.
tent, on the other hand, is often detected near manipulated boundaries
by shading mismatches and significant differences in intensities of the
4.1. Spatial techniques: Frame-level detection pixels adjacent to the boundaries.
Another significant inconsistency identified by spatial methods is
Spatial techniques form the backbone of DeepFake detection, where misaligned facial features. When face-swapping algorithms do use ma-
individual frames are analysed to find some visual inconsistencies or nipulation content but are unable to seamlessly integrate it into the
artifacts introduced in the manipulation. These methods make use of original frame, alignments of eyes, lips, or facial proportions often ex-
static anomalies, such as blending errors, inconsistencies in texture, or perience errors. These spatial inconsistencies are particularly strong in
unnatural lighting that usually arise due to the limitations of DeepFake early generation DeepFakes (Chang et al., 2020; Taeb & Chi, 2022). Fur-
generation models. CNNs may be used for spatial analysis since they thermore, unrealistic textures, such as too smooth skin or checkerboard
can extract features at the pixel level effectively. CNN-based techniques artifacts, are common artifacts generated by generative models, espe-
detect abnormalities, such as unnatural textures, shading mismatches, cially GANs, when generating DeepFake images (Karanwal & Diwakar,
2023; Liu et al., 2021).
or compression artifacts, with strong detection performances for frame-
Although spatial methods are highly successful in detecting frame-
level anomalies (Chang, Wu, Yang, & Feng, 2020; Liu, Boongoen, &
level anomalies, they are less effective against high-def generated Deep-
Iam-On, 2024; Liu, Zhu, Lu, Luo, & Zhao, 2021; Taeb & Chi, 2022).
Fakes. As generation techniques become more sophisticated, artifacts
For improved spatial detection, edge and texture analysis can find
in individual frames become increasingly subtle, making it more chal-
increased use when analysing manipulated images, helping to high- lenging for spatial methods to detect manipulations. Furthermore, these
light the finer details. Methods such as Local Binary Patterns (LBP) methods do not account for motion inconsistencies or sequence-level
investigate local textures to identify oversmoothing or slight textural anomalies, necessitating the use of temporal or multimodal approaches
anomalies properties associated with manipulated face areas. These to achieve exhaustive detection (Chang et al., 2020; Myvizhi & Pamila,
techniques are superior for the detection of anomalies that are not 2022; Zhang, Wu, Li, Zhu, & Sheng, 2022). Some of these problems
visible to the human eye (Abdullah & Ali, 2023; Karanwal & Diwakar, can be solved by including temporal analysis with spatial techniques
2023; Sedaghatjoo, Hosseinzadeh, & Bigham, 2024). In a similar vein, to identify dynamic inconsistencies throughout video sequences,leading
edge detection algorithms also known as Sobel and Canny filters are to a firmer detection framework (Lewis et al., 2020; Myvizhi & Pamila,
utilised to detect sharp discrepancies or transitions at the boundary 2022).

5
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

4.2. Temporal techniques: Sequence-level analysis expression, and behaviour. Additionally, advanced models such as
audio–visual transformers and cross-modal attention mechanisms have
Temporal methods are aimed at discovering inconsistencies in mo- also shown remarkable performance in recognising lip-sync errors and
tion or continuity between video frames, and generally rely on the tem- are very effective in identifying audio–visual mismatches (Lomnitz
poral ordering of video streams to analyse DeepFake distortions. Unlike et al., 2020; Suratkar, Kazi, et al., 2020).
spatial methods that focus on specific frames, temporal techniques Physiological signal analysis is another key component of mul-
leverage the continuity and interrelation between adjacent frames to timodal detection. However, DeepFake videos fail to capture subtle
identify anomalies that static analysis may overlook, making them biometric cues that unconsciously exist in original videos, like heart
especially powerful in detecting artificial motion, lip synchronisation rate change, skin tone change, and micro-expressions. Remote photo-
discrepancies, or sudden frame transitions, phenomena which naturally plethysmography (rPPG) and ISP are examples of the approach that
occur from the constraints of DeepFake generation models. study facial colour variations to estimate the heart rate, making it
Motion dynamics is one of the most common methods used for tem- possible to assess the manipulation. It would be quite challenging to
poral analysis. Modelling temporal dependencies in video sequences convincingly synthesise these physiological signals, which makes them
is primarily achieved using RNNs, particularly LSTMs, which excel at a good indicator of whether tampering occurred (Coccomini et al.,
capturing subtle motion patterns and identifying abnormal patterns 2023; Taeb & Chi, 2022).
such as jittery head movements or unexpected frame-to-frame tran- Hybrid feature fusion often merges spatial, temporal, and cross-
sitions. Likewise, optical flow-based methods track the displacement modal features to offer a comprehensive detection framework. Most
of facial features and objects between frames, enabling the detection action recognition models consist of a spatial CNN to extract spatial
of inconsistencies in motion patterns. This approach is particularly features for single frames and an RNN or other temporal model to
valuable for attack detection, as it draws inspiration from anomaly model dependencies across frames. In combination with audio fea-
detection techniques by identifying poor frame alignments followed by tures, these approaches yield strong multimodal models that can detect
interpolation (Myvizhi & Pamila, 2022; Zhang, Wu, et al., 2022). complex anomalies cross-modally. By combining the complementary
Another essential aspect of sequence-level analysis is temporal con- interactions among visual, audio, and physiological data, multimodal
sistency. DeepFake videos usually have some inconsistencies in ex- transformers and CNN-RNN hybrid frameworks have achieved con-
pression or face movement such as blinking of the eye, tilt of head, siderable advancements in detection accuracy (Lomnitz et al., 2020;
movement of lip, etc. Temporal coherence models are trained to iden- Muppalla, Jia, & Lyu, 2023).
tify such subtle differences between frames that are challenging to Although multimodal methods are powerful, they are also computa-
achieve in generated content naturally. The temporal convolutional tionally complex, often requiring both streams to be temporally aligned
networks (TCN) and hybrid CNN-RNN architectures have demonstrated for implementation. However, they cannot detect discrepancies across
great potential to discover such discrepancies (Lewis et al., 2020; Wang different modalities and thus become ineffective against high-quality
& Dantcheva, 2020). DeepFakes. Nevertheless, the combination of spatial, temporal and mul-
One of the important aspects of temporal analysis is audio–visual timodal approaches enhances the robustness of the detection systems,
synchronisation, as DeepFake videos do not always synchronise speech making them resilient to increasingly complex manipulations (Mup-
and lip movements accurately. Cross-modal evaluation models to assess palla et al., 2023; Suratkar, Kazi, et al., 2020).
the timeliness and coherence of audio and visual streams while mea-
suring the differences between them in terms of synchronisation. This 4.4. Hybrid frameworks and advanced architectures
method is especially useful for spotting discrepancies, such as lagging
lip movements or non-matching speech in manipulated audio–visual As DeepFake technologies continue to evolve, researchers are at-
content (Lomnitz et al., 2020; Suratkar, Kazi, et al., 2020). tempting to combat them by implementing more advanced architec-
Temporal techniques can capture motion dynamics and sequence- tures and hybrid frameworks. These methods enable fusion of mul-
level inconsistencies, but they are also limited. As these advanced tiple detection paradigms for cooperative analysis of spatio-temporal
models can produce high-quality DeepFakes which can also minimise and multimodal data streams. The section reviews cutting-edge ap-
these artifacts, thus making detection more difficult. For instance, proaches that involve hybrid models and sophisticated architectures
algorithmically temporal methods usually need far more numerical that improve precision in detection and robustness to attacks.
branches compared to parallel techniques due to processing lumps of
frames that have not yet been processed independently of each other. In 4.4.1. Hybrid frameworks
spite of these issues, such temporal techniques are often still necessary Hybrid approaches exploit the strengths of multiple architectures
components in DeepFake video detection pipelines, and most thorough to improve DeepFake detection. These models utilise complementary
detection frameworks combine spatial or other multimodal methods approaches to achieve strong accuracy in detecting spatio-temporal in-
with a temporal component (Taeb & Chi, 2022; Zhang, Wu, et al., consistencies, which are essential for spotting manipulations in videos.
2022). Such frameworks are indeed good for frame-level spatial features and
sequence-level temporal anomalies, which contributes to a global de-
4.3. Multimodal techniques: Cross-stream analysis tection method for video.
A hybrid framework that fused CNNs and RNNs is presented in
Multimodal approaches combine diverse modalities, including vi- the literature. CNNs extract spatial features from each video frame,
sual, audio, and physiological signals, improving DeepFake detection including textures and facial artifacts, while RNNs learn temporal de-
accuracy and robustness. This method characterises inter-modal rela- pendencies from sequential frames. CNN-RNN hybrids, which combine
tionships, identifying discrepancies that may be missed in uni-modal these architectures, successfully catch spatio-temporal anomalies, such
processing. These methods are particularly useful for solving complex as unnatural lip-sync, abrupt transitions, and jerky movements. These
video DeepFakes manipulations that can break cross-stream correlation frameworks react significantly better to the inconsistencies that static or
(between modalities, such as audio or visual). sequential models alone could miss (Khormali & Yuan, 2022; Lomnitz
Multimodal techniques have been extensively used in audio–visual et al., 2020).
tasks. This approach addresses detection of temporal misalignment An alternative approach which is widely adopted is a hybrid method
between speech and lip movements, which is another potential bug by combining CNNs with transformers. As depicted in Fig. 4, dense
in DeepFake videos. These approaches use discrepancies between au- neural networks (CNN) take care of the main feature extraction of the
dio and visual data streams to detect differences in timing, facial spatial patterns; whereas, transformers analyse the temporal patterns

6
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

Fig. 4. Hybrid framework (Khan & Dang-Nguyen, 2022).

by means of the self-attention mechanism. This multi-grained union for detection purpose. The major approaches are GANs, transformers,
strongly learns both short-distance and long-distance dependencies in capsule networks, disentangled representation learning, unsupervised
the video data. Facial landmark detection, normalisation and aug- domain adaptation, and improved feature extraction.
mentation, All these preprocessing steps contribute to making these
models even more robust. CNN architectures such as XceptionNet and Generative adversarial networks (GANs)
EfficientNet-B4 extract features, which are then fused with positional In the context of DeepFake detection, GANs play a dual role. Orig-
embeddings and fed through a series of transformer encoder blocks. inally proposed to generate realistic synthetic data, GANs have re-
It is capable of reliable detection of advanced deceptions such as cently been modified to detect the presence of manipulations — this
non-matching facial expressions, or slight speed inconsistencies (Bondi is achieved by leveraging their adversarial training process. Generative
et al., 2020; Gani-yusufoglu, Ngô, Savov, Karaoglu, & Gevers, 2020). Adversarial Network detection models comprise a generator and a dis-
Hybrid frameworks are not limited to CNNs and transformers alone. criminator, with the generator generating the most realistic DeepFakes
Alternate techniques, such as CNNs with optical flow processing, em- possible and the discriminator learning to differentiate between gen-
phasise motion sequences and temporal continuity for smaller detection uine and fake content. This process running iteratively in an adversarial
of DeepFake alterations. These frameworks successfully identify arti- setting further improves the discriminator to identify finer anomalies.
facts that can be caused by inconsistent movement between frames, The GAN structure is depicted in Fig. 5 as a multi-generator and
ensuring increased accuracy (Bondi et al., 2020; Khormali & Yuan, multi-discriminator framework (Coccomini et al., 2023; Ding et al.,
2021). 2021). This multi-task framework trains three generators (H1, H2 and
Although hybrid frameworks take advantage of billions of param- H3) to generate DeepFake photos across different object classes under
eters from multiple architectures, they naturally incur higher compu- real scenario and logically inconsistent outputs. These condensed and
tational costs. However, hybrid frameworks consistently outperform deceptive images force the discriminators to continually reconvene,
single-architecture models in terms of accuracy and generalisation on improving their potential to find minor adjustments in the lot. This
benchmark datasets (Gani-yusufoglu et al., 2020; Khormali & Yuan, design pushes towards the generation of more and more advanced
2021). DeepFakes, forcing the discriminator to learn and improve detection
CNNs and transformers can provide spatial and temporal analysis techniques (Ding et al., 2021; Giudice, Guarnera, & Battiato, 2021;
as shown in Fig. 4, combined into the hybrid framework. In the front- Gong, Kumar, Goh, Ye, & Chi, 2021).
end, a pre-processing module such as facial cropping and normalisation The architecture uses six components, all designated as J1 to J6—to
is followed by CNN-based spatial feature extraction. The transformer serve as discriminators for any image based on the layer of the network
encoder blocks model captures temporal dependencies and long-range it goes through. Specifically, odd-numbered discriminators (J1, J3, and
interactions among frames. When combined, these methods can suc- J5) are trained to assess real images and even-numbered ones (J2, J4,
cessfully overcome existing spatio-temporal inconsistencies, laying the and J6) are supposed to discriminate fake images. The staged evalu-
foundation for a unified method of DeepFake detection (Khormali & ation provides insight into subtle inconsistencies, and thus guarantees
Yuan, 2022; Lomnitz et al., 2020). that both original and re-crafted DeepFakes are examined in detail. The
model enforces multiple layers of scrutiny, making it more powerful in
4.4.2. Advanced architectures detecting sophisticated fakes more effectively (Coccomini et al., 2023;
DeepFake detection techniques that use complex architectures are Ding et al., 2021; Giudice et al., 2021).
cutting-edge advances capable of handling ever more sophisticated One of the most powerful aspects of GAN-based models is that
manipulations. Deep learning based architectures can extract complex they are amenable to adversarial training. This creates a feedback
patterns and anomalies in manipulated content which can be used loop where the discriminator must consistently improve its ability

7
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

Fig. 5. GAN framework (in Abbas and Taeihagh (2024) based on Ding et al. (2021)).

to differentiate between real and generated samples, as it is pitted textures and structures in facial components, even if the DeepFakes are
against progressively better generators. Because generation techniques low quality or compressed (Choudhary, Saurav, Saini, & Singh, 2023;
are evolving, adversarial nature keeps GANs effective against sophis- Khalil, Youssef, & Saleh, 2021).
ticated DeepFake strategies. This competitive nature generates strong Capsule networks are particularly useful for detecting DeepFakes
and generalisable detection features (Coccomini et al., 2023; Giudice with subtle distortions, such as changes in skin texture or facial ge-
et al., 2021; Gong et al., 2021). ometry. Their ability to capture part-whole relationships makes them
a powerful tool for addressing complex manipulations (Stanciu &
Transformers Ionescu, 2022).
The introduction of Transformers, specifically ViTs, led to an in-
spiring and robust architecture already being used widely in DeepFake
detection. The study proposes an alternative approach based on vision Disentangled representation learning
transformer (ViTs) for modelling video + depth sequences, a decision Disentangled representation learning involves extracting domain-
motivated by the use of self-attention mechanisms for modelling spatial invariant features so that models can distinguish manipulated content
and temporal dependencies. Traditional CNNs have a limited scope, from genuine content. This enables to train detection models with
being confined to what they can learn by focusing on localised fea- greater test performance by learning representations not specific to
tures, while transformers process entire sequences, and therefore suit certain manipulation techniques. The disentangled features are more
themselves to long-range dependencies. robust against various datasets and manipulation types, which serve as
ViTs segment video frames into distinct associated patches and a good candidate for DeepFake detection under unseen conditions.
consider each patch as a token in a sequence. This method then allows This method is particularly valuable in scenarios where DeepFakes
the sensitive detection of minor abnormalities such as waddling or are generated using novel or unknown techniques. By focusing on
lip-sync discrepancies. Through the addition of positional encodings, fundamental differences between real and fake content, disentangled
transformers capture temporal order, making them more suitable for representation learning ensures consistent detection performance (Ding
detecting temporal anomalies, such as unnatural blinking or sudden et al., 2021; Jia, Cheng, Lu, & Zhang, 2022).
change of appearances (Li et al., 2023; Vaswani, 2017).
Multimodal transformers generalise this as they can deal with sev-
Unsupervised domain adaptation
eral data streams — for example, visual and audio input. To improve
detection performance, these models analyse the relations of cross- The purpose of an unsupervised domain adaptation is to align
modal elements to identify the synchronisation errors of speech and lip the feature distributions from the training dataset with the feature
movements (Muppalla et al., 2023; Wang et al., 2022). This has made distributions of the testing dataset, so that models can generalise from
them critical for complex manipulations, despite their computational different environments. This is important in practice, where testing data
heavyweight nature. typically differ quite a bit from training data.
Methods such as domain adversarial neural networks (DANNs) are
Capsule networks typically employed to align the two. This allows unsupervised domain
A unique advantage of capsule networks is the information they re- adaptation to generalise well to manipulations not seen in the training
tain about spatial hierarchies within a frame of a video. Whereas CNNs data, as they tend to minimise the gap between the source and target
tend to lose spatial relationships through pooling, capsule networks domains. This is significant when combating the DeepFake techniques
preserve the relative positions of features, such as facial landmarks. which aim to escape detection by utilising biases in models (Bondi
This property allows capsule networks to identify detailed alterations to et al., 2020; Chen & Tan, 2021).

8
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

Enhanced feature extraction tecting temporal manipulations in videos requires models capable of
Some advanced feature extraction methods seek to analyse par- reasoning about changes over time. This has led to recent advance-
ticular features of video content to detect hidden anomalies. Fourier ments in the use of hybrid models that combine; Recurrent Neural
transforms and spectral analysis are among methods that focus on Networks (RNNs) (Khormali & Yuan, 2022) and Vision Transformers
anomalies in frequency elements and texture characteristics. These (ViTs) (Saikia, Dholaria, Yadav, Patel, & Roy, 2022; Zhang, Zhao, & Li,
approaches outperform on strong manipulations such as high-quality 2020) for video-based DeepFake detection. The combination of CNNs
DeepFakes (Giudice et al., 2021; Khormali & Yuan, 2021). with RNNs and ViTs enables models to overcome spatial–temporal gaps,
For example, high-frequency details can escape conventional types ultimately improving detection performance.
of detections, while spectral analysis isolates them. Since DeepFake syn- CNNs, RNNs, ViTs and Generative Adversarial Networks (GANs)
thesis introduces unique signatures, advanced feature extraction meth- are designed to create DF, but they can also be utilised in adversarial
ods integrated into detection frameworks provide robustness (Giudice training to improve the robustness of detection (Ding et al., 2021; Jung,
et al., 2021; Zhao et al., 2021). Kim, & Kim, 2020a). Some of the most commonly used DDT models in
the reviewed documents are discussed below.
5. Discussion
Hybrid
The selected 108 articles were reviewed to evaluate the applied
One of the noteworthy hybrid approaches involves using CNNs to-
DDTs, the data sets used, their strengths and weaknesses with respect
gether with RNNs, which have proven to work wonders in video based
to the generalisability of the proposed models across different datasets.
The analysis also included examining the authors’ perspectives on why detection tasks, as we discussed in 4.4.1. Five prominent studies investi-
the proposed methods struggle to generalise effectively. Furthermore, gating CNN-RNN hybrids were Chintha et al. (2020), Cho et al. (2023),
insights on how generalisability might be achieved using the proposed Jiang et al. (2021), Lewis et al. (2020), Yadav, Bommareddy, and
algorithms were gathered. These details have been systematically anal- Vishwakarma (2022) as depicted in Fig. 7. Studies consistently show
ysed and summarised in Table 3 to provide a comprehensive overview that hybrid CNN-RNN models outperform standalone CNN or RNN
of the current state of the research in this domain. The heatmap architectures, especially in tasks that involve spatial and temporal fea-
shown in Fig. 6 provides an overview of the different types of DDTs tures. CNNs are well known for capturing spatial features, making them
used and the categories of the models applied. The categorisation is an excellent choice for detecting anomalies in an individual frame.
based on the algorithms applied in 108 selected papers which will Since DeepFake images alter individual frames of the generated videos,
be discussed in Section 5.1. This comprehensive evaluation aims to RNNs (particularly Long Short-Term Memory (LSTM) networks) are
provide a valuable guide for researchers by highlighting the challenges combined with CNNs to analyse video frame sequences, detecting tem-
and potential solutions in developing robust and adaptable DeepFake poral inconsistencies in frames such as jerky movements or scene switch
detection algorithms. markers (Afchar, Nozick, Yamagishi, & Echizen, 2018; Ismail, Elpelt-
Numerous enhancements have been proposed in the reviewed ar- agy, Zaki, ElDahshan, A, 2021b; Saikia et al., 2022). Hence, CNN-RNN
ticles, showcasing innovative approaches to overcome challenges in hybrid architectures can capture higher-level spatio-temporal interac-
detecting DeepFakes. Hybrid and multimodal models effectively blend tions in video sequences, such as small temporal deformations such as
the best aspects of various methods to improve accuracy and ro- facial expressions shifts or lip-sync errors that occur across multiple
bustness. GAN-based models have emerged as a double-edged sword, frames. These models have achieved exceptionally high performance
playing a pivotal role in both DeepFake generation and detection. Their with accuracy as high as 98% and precision rates as high as 81%.
ability to generate highly realistic content necessitates the development However, more advanced techniques are needed to effectively iden-
of more sophisticated detection mechanisms to identify and exploit tify manipulations across multiple frames, such as, subtle facial changes
subtle artifacts in the generated media. or unnatural transitions that occur over time. For example, CNN-RNN
Transformers have been used to handle sequential data and capture hybrid networks combined with auto-encoders, have been proposed
long-range dependencies in videos and have become increasingly popu- as a promising approach to address these challenges (Agarwal et al.,
lar for video DF analysis tasks. Many novel approaches, such as capsule 2020; Kirn et al., 2022; Wang & Dantcheva, 2020). In addition, CNN-
networks or disentangled representations, have been introduced to en- RNN hybrid networks have been extended to other hybrid approaches,
hance the re-usability and adaptability of both generation and detection such as fusing audio–visual data and utilising person-based techniques.
models. Capsule networks aim to better capture spatial hierarchies These models incorporate multiple modalities in a fused form, en-
and relationships in data, while disentangled representations facilitate hancing their detection capabilities. Audio–visual attention models, for
the isolation of specific features, improving model interpretability and
instance, detect discrepancies between audio and visual data, such
tranferability.
as discordance between speech and lip movements (Ge et al., 2022;
In the following sections, we have delved into the key research
Muppalla et al., 2023). Additionally, identity-referenced models focus
questions that are essential to understand the state-of-the-art methods
on assessing the compatibility of observed facial attributes with previ-
often applied in current literature and their generalisability in detecting
ously known identity information to improve detection accuracy (Lewis
DF. We have analysed various detection techniques identified through
et al., 2020; Wang, Jiang, Jin & Cui, 2022). Since hybrid fakes combine
our systematic literature review, evaluating their effectiveness and the
visual, auditory, and identity-referenced, it becomes fairly impossible
datasets they utilised. In addition, we have examined recent advance-
ments in detection technologies, highlighted current challenges and for DeepFake developers to manipulate all aspects of the entire media
limitations, and discussed potential future trends that could influence file simultaneously.
the development of more generalisable DeepFake detection methods.
Ensemble
5.1. RQ1: What state-of-the-art methods are often used in the current An emerging trend in the field of DeepFake detection is the adop-
literature on detecting DeepFake videos? tion of ensemble methods to address challenges encountered during
the learning process of deep learning algorithms. Ensemble models
Convolutional Neural Networks (CNNs) have gained significant involve training and combining multiple baseline models or similar
prominence in recent years due to their ability to capture spatial– models to enhance the robustness, realiability and accuracy of detection
temporal dependencies from adjacent frames in sequences of frames systems (Mohammed & Kora, 2023). The most widely applied ensem-
in a video, such as small inconsistencies in facial details or textures ble techniques include averaging, bagging, stacking, random forest,
that alter the semantics of objects (Bondi et al., 2020; Coccomini and boosting. For example, the fusion of models like YOLO-CNN-
et al., 2022; Mcuba, Singh, Ikuesan, & Venter, 2023). However, de- XGBoost (Rana & Sung, 2020) and the integration of models such as

9
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

Fig. 6. Heatmap of models used across different categories.

XceptionNet, LSTM, and CNN (Mcuba et al., 2023) demonstrate how 2020; Mcuba et al., 2023; Rana & Sung, 2020). Ensemble models
ensemble approaches can effectively combine strengths of different combine several detection mechanisms to capture discrepancies in
models. Ensemble methods address the limitation of single-model ap- both spatial and temporal features across various types of DF, lead-
proaches in DeepFake detection by combining models specialised in ing to improved performance. In addition, they are especially helpful
detecting various types of DeepFake manipulations. in circumstances where one detection method may be insufficiently
However, the studies indicate that ensemble methods can achieve performing on its own. Efficiently designed ensemble models have the
high accuracy (ranging from 88% to 97%) depending on the models ability to recognise a broader range of potential DeepFake alterations
used in the ensemble models, and the applied dataset (Lewis et al., and unseen manipulations.

10
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

Fig. 7. Variety of hybrid models used by researchers.

CNN-based techniques Xu, 2024) showing their effectiveness in handling complex DeepFake
CNNs are crucial for detecting spatial abnormalities in frames of manipulations.
an image or video, recognising the contortion in facial components, Moreover, the Transformers are among the powerful tools for deal-
skin patterns, and suboptimal illumination conditions triggered by ing with sequential data over time, making them highly effective
facial spoofing-attackers (Bondi et al., 2020; Coccomini et al., 2022; for video-based detection. This review identifies five key papers that
Mcuba et al., 2023). Additionally, with temporal information sup- showcase the effectiveness of Vision Transformers (ViTs) in detecting
port, these models help to analyse spatial and temporal features of DF across video frames by analysing both spatial and temporal fea-
videos (Coccomini et al., 2022; Saikia et al., 2022). tures (Coccomini et al., 2022; Khan & Dai, 2021; Khormali & Yuan,
CNN-based methods, such as 3D-CNNs, have proven to be effective 2022; Zhang et al., 2024, 2020). As transformers capable of learning
in capturing motion-related artifacts across multiple frames. As illus- long dependencies between frames, they can discover slight anomalies
trated in Fig. 8, numerous studies have demonstrated the effectiveness that might be overlooked in the frame-by-frame detection.
of 3D-CNNs for DeepFake detection. The 3D-CNNs are capable of GANs serve a dual purpose in DF as we discussed in 4.4.2. This
learning fine-grained details from interpolative frames, particularly in adversarial setup allows models to create new types of DeepFake ma-
detecting changes in facial expression and unrealistic interchanges, as nipulations but also encourages the use of GAN-based methods to detect
emphasised in Bondi et al. (2020), Liu et al. (2021), Saikia et al. (2022). DF (Ding et al., 2021; Hu, Wang, & Li, 2021; Jung, Kim, & Kim, 2020a).
The utilisation of other CNN variations, besides 3D-CNNs, such as time- Specifically, these models achieve detection accuracies between 88%
distributed CNN, and lightweight 3D-CNN, are also being explored. and 96%, showcasing their versatility and effectiveness on a variety of
These models are particularly useful for handling large-scale datasets datasets and tasks (Ding et al., 2021; Hu et al., 2021; Jung, Kim, & Kim,
where computational efficiency is crucial. The time-distributed CNNs, 2020a).
process each video frame independently while preserving temporal
connections. These models are ideal candidates for detecting both spa- Hybrid I3D ViViT models
tial and temporal inconsistencies in video frame sequences (Liu et al., An effective approach to improving generalisation in DFD is through
2021; Zhang et al., 2020). Furthermore, lightweight 3D-CNNs offer the development of hybrid models. One promising combination is
competitive computational costs compared to traditional 3D-CNNs, the integration of I3D (Inflated 3D ConvNet) and ViViT(Video Vision
making them well-suited for real-time detection tasks. Generally, CNN- Transformers) architectures. I3D is highly effective at capturing spatial–
based methods achieve an accuracy rate ranging from 77% to 88%, temporal features in video data, enabling it to analyse motion and
positioning them as strong options for DeepFake detection (Bondi et al., appearance across frames. On the other hand ViViT utilises its intrinsic
2020; Liu et al., 2021; Wang, Cheng, Chow, & Nie, 2023). transformer architecture to model long-range dependencies. Therefore,
by integrating these strengths, this hybrid model could effectively
Other variety of models detect manipulations across different video sequences. Although this
Although CNNs and hybrid approaches remain dominant in the hybrid combination is not explored in the selected literature, this
field, other models, including multimodal have demonstrated signif- integration has the potential to offer significant improvements over
icant potential for DeepFake detection. These models integrate data current DeepFake detection models by improving over current Deep-
from various streams, such as audio and video to enhance detection Fake detection models, particularly to improve the generalisability of
capabilities. As illustrated in Fig. 9, Multi-modal visual and audio existing detection techniques across various datasets and DeepFake
models have been studied in 4 research articles (Chen, Kumar, Na- types (Coccomini et al., 2022; Khormali & Yuan, 2022; Zhang et al.,
garsheth, Sivaraman, & Khoury, 2020; Ge et al., 2022; Zhang, Lin, & 2020)

11
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

Fig. 8. Variety of CNN models.

Fig. 9. Other variety of models.

5.2. RQ2: What are the main issues and constraints experienced by re- distributions, making them less effective at adapting to new or unseen
searchers and technologists when enhancing video DeepFake detection al- types of DF. Generalising well across datasets remains an open problem.
gorithms?
Availability of datasets & types One of the key limitations is the
prevalence of homogeneous datasets. Though they provide a large vari-
Although significant advances have been made in DeepFake detec- ety of DeepFake demonstrations, datasets such as FF++ and Celeb-DF,
tion, substantial challenges persist. they do not include every type of manipulation in existence, especially
Generalisation to new datasets The biggest challenge with DeepFake the newer and more complicated techniques (Chen & Tan, 2021; Hu
detection models is generalisation. A model may be well-trained on et al., 2021). It is much harder to train and test models across the
full spectrum of DeepFake manipulations. Therefore, there is a growing
some specific dataset but may still fail to produce accurate results, even
demand for more comprehensive real-world datasets that are supported
if the test dataset is similar in category to the training dataset. This
in favour of more generalisable DeepFake detection systems.
issue arises mainly due to the deficiencies in current datasets, such as
FF++, DFDC, and Celeb-DF which do not encompass the full spectrum Real-time detection and the speed of computation The computa-
of real-world DeepFake manipulations (Ge et al., 2022; Liu et al., 2021). tional complexity of most state-of-the-art DeepFake detection models
As a result, detection algorithms frequently overfit to available data is a significant challenge for DeepFake detection systems. CNN-RNN

12
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

Table 3
Summary of DDT used in Reviewed Literature.
Title Datasets Model Used Strengths Weaknesses Generalisability
Coccomini et al. (2023) FF++, XceptionNet High detection accuracy for Limited scalability & poor No
DFDC manipulated videos by adaptability to unseen datasets;
leveraging depth-wise separable resource-intensive, require
convolutions in controlled high-end GPUs for inference.
environment.
Gani-yusufoglu et al. Celeb-DF, 3D-CNN Captures spatiotemporal Overfitting risk on small No
(2020) DFDC features, improving detection datasets, computationally
accuracy on manipulated video expensive due to
sequences. 3Dconvolutional operations.
Hu et al. (2021) FF++, Disentangled Enhances generalisation by Complex training, high Yes
Celeb-DF Representation Learning isolating domain-invariant computational overhead,
features, reducing dataset bias. sensitive to hyperparameter
setting.
Ismail, Elpeltagy, Zaki, FF++, YOLO-CNN-XGBoost Efficient feature extraction, Depend on quality of extracted Not Mentioned
Eldahshan, Kamal (2021a) UADFV classification pipeline provide features; struggles with
faster training, inference times. capturing temporal
inconsistencies.
Gong et al. (2020) DFDC, DCGAN Generates adversarial examples High computational cost for No
DeepFake to train detection models more generating adversarial samples
Detection robustly against diverse and convergence issues during
Challenge manipulations. GAN training.
Lewis et al. (2020) DFDC, Hybrid CNN-RNN Leverages multimodal Requires extensive training data Yes
Celeb-DF inconsistencies, improving and struggles with
robustness against high-quality high-resolution videos due to
DeepFakes. computational bottlenecks.
Nguyen, Tran, Nguyen, Celeb-DF, LSTM-CNN Effectively models temporal High inference latency and No
Truong, et al. (2021) FF++ inconsistencies in videos, difficulty handling longer
enhancing detection of sequences.
manipulated motion.
Guarnera et al. (2022) DFDC, Various (Ensemble) Provides a comprehensive Inconsistent results across Not Mentioned
Google benchmark for evaluating models; ensemble methods
DeepFake detection models across require high computational
datasets. resources.
Agarwal et al. (2020) FF++, Bi-LSTM Integrates behavioural and Limited performance on datasets Yes
Celeb-DF visual cues, improving detection with low variability in
accuracy for subtle behaviours or static images.
manipulations.
Rama chandran, UADFV, DeepFace Leverages pre-trained face Overfits to datasets focused on No
Nadimpalli, and Rattani DFDC embeddings, reducing training facial features, limiting general
(2021) time and improving accuracy on applicability to non-facial
facial regions. manipulations.
Rana and Sung (2020) DFDC, DeepStack Ensemble Combines diverse model Computationally expensive Not Mentioned
FF++ outputs, making it more robust ensemble setup, requiring
to adversarial examples. large-scale parallel
infrastructure.
Singh et al. (2020) FF++, Time-Distributed CNN Efficiently captures temporal Performance drops on long No
DFDC relationships in manipulated video sequences due to fixed
videos, enhancing sequential input size constraints.
analysis.
Siegel et al. (2021) FF++, Hand-Crafted Features Lightweight and interpretable, Limited performance against Not Mentioned
UADFV suitable for low-resource high-quality manipulations;
environments. lacks scalability for diverse
datasets.
Bansal et al. (2023a) Celeb-DF, Lightweight CNN Optimised for real-time Lower accuracy on No
DFDC detection with minimal latency. high-resolution fakes; lacks
robustness to sophisticated
manipulations.
Khalil et al. (2021) FF++, Capsule Networks Models hierarchical spatial High computational demand for Yes
DFDC relationships, improving training, making it unsuitable
accuracy on subtle facial for low-resource settings.
manipulations.
Bondi et al. (2020) DFDC, Data-Augmented CNN Improves robustness against Performance heavily dependent Not Mentioned
FF++ unseen manipulations through on quality and variety of
data augmentation techniques. augmentations.
Suratkar, Kazi, et al. Celeb-DF, Transfer Learning CNN Reduces training time by Requires significant fine-tuning Yes
(2020) FF++ leveraging pretrained networks, for new datasets; struggles with
maintaining high accuracy. domain shifts.

(continued on next page)

13
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

Table 3 (continued).
Title Datasets Model Used Strengths Weaknesses Generalisability
Khormali and Yuan (2021) DFDC, Attention Mechanism Enhances focus on critical Computationally expensive on No
Celeb-DF features, reducing false positives larger datasets; not optimised
in detection. for low-resource environments.
Liu et al. (2023) DFDC, Federated Learning Preserves privacy while High communication overhead Yes
Google improving generalisation across during training; requires
DeepFake diverse devices. synchronised device
participation.
Ismail, Elpeltagy, Zaki, FF++, YOLO-Face-RNN Combines accurate face High inference latency for large No
ElDahshan, A (2021b) Celeb-DF localisation with sequential datasets; struggles with
analysis for improved detection. non-facial manipulations.
Ge et al. (2021) FF++, Predictive Learns representations that High training complexity and Yes
Celeb-DF Representation Learning enhance generalisation across dependence on large-scale
datasets. datasets for effectiveness.
Wang, Jiang, et al. (2022) DFDC, Feature Point-Based Detects facial inconsistencies Limited accuracy for subtle No
Celeb-DF Detection efficiently using feature points, manipulations or high-quality
reducing computational cost. fakes.
Vinay et al. (2022) Celeb-DF, AFMB-Net Integrates heart rate analysis, Sensitive to video quality and No
UADFV leveraging physiological fails in cases of occlusion or
differences between real and low frame rates.
fake videos.
Wang and Dantcheva DFDC, 3D-CNN Evaluates multiple 3D-CNN Computationally expensive and No
(2020) FF++ architectures, identifying overfits on smaller datasets.
optimal configurations for
DeepFake detection.
Jung, Kim, and Kim UADFV, Eye Blinking Detection Utilises eye blinking Ineffective against high-quality No
(2020b) Celeb-DF inconsistencies to detect fake DeepFakes that accurately
videos, requiring minimal mimic blinking patterns.
computation.
Zi et al. (2020) Wild- CNN-RNN Introduces a real-world dataset, Dataset diversity limits Not Mentioned
DeepFake improving robustness testing for generalisation for extreme edge
detection models. cases.
Burroughs, Gokaraju, Roy, DFDC, Feature Engineering Employs hand-crafted features Limited scalability for No
and Khoa (2020) UADFV CNN to reduce computational cost high-quality DeepFake datasets.
while maintaining reasonable
accuracy.
Liu et al. (2021) DFDC, Lightweight 3D-CNN Optimised for lower latency and Lower accuracy on Yes
Celeb-DF reduced resource requirements, high-resolution manipulations.
supporting real-time detection.
Chintha et al. (2020) Celeb-DF, RNN-CNN Combines audio and video cues Requires high-quality audio and Yes
UADFV to enhance detection accuracy. video synchronisation, limiting
applicability.
Muppalla et al. (2023) DFDC, Audio-Visual CNN-RNN Multimodal integration Computationally expensive; Yes
FF++ improves robustness against relies on high-quality
diverse manipulations. multimodal data.
Lee, Lee, and Yoo (2023) FF++, Domain-Wise Clue Identifies domain-specific Struggles with unseen domains Yes
Celeb-DF Detection artifacts, increasing detection or datasets with minimal
reliability. artifacts.
Khormali and Yuan (2022) DFDC, Vision Transformer Leverages attention mechanisms High computational cost and Yes
Celeb-DF for robust detection of subtle memory requirements.
manipulations.
Saikia et al. (2022) FF++, CNN-LSTM with Optical Utilises motion inconsistencies, High computational demands Yes
Celeb-DF Flow enhancing temporal detection. for optical flow estimation.
Deng et al. (2022) Celeb-DF, EfficientNet-V2 Achieves high accuracy with Limited performance on No
DFDC lower computational cost low-resolution videos.
compared to traditional CNNs.
Younus and Hasan (2020) DFDC, Haar Wavelet Reduces computational cost by Struggles with temporal No
UADFV Transform extracting efficient spatial inconsistencies or
features. high-resolution manipulations.
Afchar et al. (2018) Celeb-DF, MesoNet Compact architecture suitable Limited robustness against No
UADFV for low-resource environments. high-quality manipulations.
Zhang, Wu, et al. (2022) DFDC, Cascaded-Hop Model Iteratively improves detection High latency due to multiple No
FF++ through cascaded layers, cascaded processing steps.
enhancing precision.

(continued on next page)

14
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

Table 3 (continued).
Title Datasets Model Used Strengths Weaknesses Generalisability
Lai, Wang, Feng, Hu, and DFDC, Multi-Feature Fusion Combines spatial, temporal, and Computationally expensive; Yes
Xu (2022) Celeb-DF frequency features for robust requires extensive preprocessing.
detection.
Gani-yusufoglu et al. FF++, Spatio-Temporal CNN Captures temporal artifacts High memory requirements; No
(2020) DFDC effectively in video sequences. struggles with high-resolution
videos.
Zhao et al. (2021) DFDC, MFF-Net Combines multimodal features, Requires significant Yes
Celeb-DF improving robustness against computational resources for
diverse manipulations. training.
Agarwal, Agarwal, Sinha, FF++, MD-CSDNetwork Cross-domain stitching improves High complexity due to Yes
Vatsa, and Singh (2021) Celeb-DF feature sharing across domains, multi-domain processing;
enhancing detection robustness. requires large datasets for
training.
Zhang et al. (2024) DFDC, Audio-Visual Attention Combines multimodal data with Resource-intensive; performance Yes
Celeb-DF with Contrastive contrastive learning, improving depends on the quality of
Learning generalisation and robustness. synchronised audio–visual data.
Lin et al. (2023) DFDC, Spatiotemporal Trident Efficiently captures spatial and Requires extensive No
FF++ Network temporal features with computational resources due to
multi-branch processing. multi-branch architecture.
Tang et al. (2024) Celeb-DF, DeepMark Framework Scalable framework capable of Performance varies with the Yes
UADFV integrating diverse detection integrated models; requires
methods. tuning for specific datasets.
Wang et al. (2023) DFDC, Convolutional Pooling Combines convolutional pooling High computational cost and Yes
Celeb-DF Transformer and transformer attention, memory requirements.
improving detection of subtle
manipulations.
Zhang et al. (2022) ADD Score Fusion Model Aggregates multiple model Overfitting risk due to reliance No
Challenge outputs, improving overall on specific dataset
Dataset accuracy for the challenge. characteristics.
Khan and Dang-Nguyen FF++, Hybrid Transformer Combines CNN and Transformer High resource requirements for Yes
(2022) Celeb-DF Network features for enhanced training; sensitive to
robustness. hyperparameter selection.
Xiao, Zhang, Yang, Wen, DFDC, Transformer Detects forgery by focusing on Limited performance on No
and Li (2023) Celeb-DF invariant features and enhanced low-resolution videos.
details.
Nadimpalli and Rattani FF++, GAN-based Adds visible watermarks, Watermarking can be removed No
(2023) Celeb-DF Watermarking making DeepFakes easier to by advanced forgery methods.
detect.
Zhang et al. (2022) DFDC, Spatiotemporal Dropout Efficiently captures Requires large-scale datasets to Yes
Celeb-DF Transformer spatiotemporal inconsistencies prevent overfitting.
with dropout regularisation.
Lugstein, Baier, Bachinger, DFDC, PRNU-Based Detection Leverages Photo-Response Limited to detecting No
and Uhl (2021) FF++ Non-Uniformity (PRNU) for manipulations visible in camera
robust detection of camera artifacts.
inconsistencies.
Jaleel and Hadi (2022) Celeb-DF, Action Unit Detection Exploits inconsistencies in facial Struggles with high-quality fakes No
DFDC action units for detection, mimicking natural expressions.
improving robustness.
İlhan, Balı, and Karaköse FF++, NASNetLarge CNN High accuracy with reduced Resource-intensive for training No
(2022) UADFV computational cost due to and fine-tuning.
efficient neural architecture
search.
Joseph and Nyirenda DFDC, Two-Stream Capsule Combines spatial and temporal High computational demands Yes
(2021) Celeb-DF Network streams for improved forgery and complex architecture.
detection.
Yadav et al. (2022) FF++, Generalised Model Enhances robustness by Limited by dataset diversity; Yes
Celeb-DF addressing overfitting with struggles with extreme edge
domain adaptation techniques. cases.
Garde, Suratkar, and Kazi DFDC, 3D-CNN Utilises advanced AI algorithms Lacks transparency in No
(2022) Celeb-DF for forgery detection. methodology and
generalisability metrics.
Ranjan, Patil, and Kazi FF++, Transfer Learning CNN Reduces training time while Requires fine-tuning for diverse Yes
(2020) UADFV improving performance across datasets.
datasets.
Zhao, Wang, and Lu DFDC, Two-Stream CNN Effectively captures global and High computational cost and Yes
(2020) Celeb-DF local inconsistencies in video sensitivity to noise in the data.
sequences.

(continued on next page)

15
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

Table 3 (continued).
Title Datasets Model Used Strengths Weaknesses Generalisability
Stephen and Mantoro Celeb-DF, lightweight 3D-CNN Specialises in detecting Limited to specific forgery No
(2022) DFDC face-swapping manipulations. types; lacks flexibility for other
manipulations.
Cho et al. (2023) Wild- Multimodal Provides insights into real-world Model performance varies Not Mentioned
DeepFake DeepFake detection challenges significantly depending on
datasets used
Pasupuleti, Tathireddy, DFDC, Custom Densenet High accuracy with feature Overfits to specific datasets; No
Dontagani, and Rahim FF++ reuse, reducing the struggles with unseen
(2023) computational footprint. manipulations.
Mitra, Mohanty, Corcoran, Social Key Frame CNN Reduces computational Limited performance on No
and Kougianos (2021) Media overhead by analysing key high-quality and temporally
Dataset video frames. inconsistent fakes.
Pryor, Dave, Vanamala, Hybrid CNN-SVM Hybrid Combines feature extraction High dependency on feature No
et al. (2023) Dataset (CNN) and effective selection quality; scalability
classification (SVM). issues.
Stanciu and Ionescu (2023) Celeb-DF, Autoencoder Enhances training data Performance highly dependent Yes
DFDC Augmentation diversity, improving robustness on augmentation quality.
against unseen manipulations.
Nirkin, Wolf, Keller, and DFDC, Context-Based CNN Detects inconsistencies between Requires high-resolution video No
Hassner (2021) FF++ facial features and their context data for effective detection.
in videos.
Jiang et al. (2021) FF++, CNN-RNN Adapts to new domains Struggles with datasets Yes
Celeb-DF effectively using domain containing minimal artifacts.
adaptation techniques.
Abdulhamid and Hashim DFDC, Enhanced Preprocessing Improved feature extraction Relies heavily on preprocessing No
(2023) Celeb-DF tech. to enhance overall techniques; computationally
detection accuracy. expensive.
Mallet, Krueger, Dave, and DFDC, MLP-LSTM Captures sequential features High memory consumption due Yes
Vanamala (2023) Celeb-DF efficiently, improving temporal to sequential processing.
detection accuracy.
Li et al. (2023) FF++, Self-Supervised Enhances detection by Resource-intensive, requiring Yes
Celeb-DF Transformer leveraging spatio-temporal extensive training data.
inconsistencies with
self-supervised learning.
Aduwala, Arigala, Desai, DFDC, GAN Discriminators Identifies manipulation artifacts Susceptible to adversarial No
Quan, and Eirinaki (2021) Celeb-DF effectively by leveraging attacks and requires high
adversarial features. computational resources.
Stanciu and Ionescu (2022) FF++, Capsule Networks Effectively captures hierarchical High computational cost and Yes
Celeb-DF spatial relationships for training complexity.
improved detection.
Heo, Yeo, and Kim (2023) DFDC, Improved Vision Leverages transformer attention Requires large-scale datasets Yes
Celeb-DF Transformer for subtle manipulation and high memory for training.
detection.
Bhaumik and Woo (2023) DFDC, Object Representation Focuses on object-level Limited robustness against No
Celeb-DF Analysis inconsistencies for improved highly realistic manipulations.
detection.
Bomma-reddy, Samyal, and DFDC, CNN with Adversarial Robust against adversarial Resource-intensive; requires Yes
Dahiya (2023) FF++ Training examples; improves detection careful adversarial training
generalisation. setup.
Liu, Li, Duan, and Huang FF++, Dual Transformer Combines spatial and temporal Computationally expensive and Yes
(2022) Celeb-DF transformers for comprehensive challenging to optimise.
detection.
Dong, Wang, Liang, Fan, Celeb-DF, Image Matching CNN Utilises image matching Limited performance on No
and Ji (2022) DFDC techniques to explain detection high-resolution videos.
decisions.
Mehta, Gupta, Video 3D-CNN Designed for real-time video Struggles with high-quality and No
Subramanian, and Dhall Confer- conferencing scenarios with low low-resolution fakes.
(2021) encing latency.
Dataset
Chen, Lin, Li, and Tan FF++, Generalised Learning Enhances generalisation by High sensitivity to noise in Yes
(2022) Celeb-DF Framework capturing intra-consistency and datasets; complex training
inter-diversity. pipeline.
Guefrachi et al. (2023) DFDC, 3D-CNN Achieves reasonable accuracy Limited robustness against No
UADFV with lightweight architecture. advanced manipulations.
Rahman et al. (2022) Low- 3D-CNN Optimised for low-resolution Limited applicability for No
Resolution DeepFake detection, reducing high-resolution and complex
Dataset computational cost. manipulations.

(continued on next page)

16
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

Table 3 (continued).
Title Datasets Model Used Strengths Weaknesses Generalisability
Tariq, Lee, and Woo DFDC, ConvLSTM Residual Effectively models temporal High memory requirements and Yes
(2020) Celeb-DF Network inconsistencies with residual computational demands.
learning.
Guefrechi, Jabra, and FF++, InceptionResnetV2 High accuracy due to advanced Computationally expensive and No
Hamam (2022) Celeb-DF architecture and residual overfits on small datasets.
connections.
Yang, Chen, and Zhong Celeb-DF, Multi-Attentional Model Combines spatial and channel High computational cost and Yes
(2023) DFDC information for robust detection. complex model tuning.
Hongmeng, Zhiqiang, Lei, Com- Super-Resolution CNN Detects artifacts in Struggles with high-quality No
Xiuqing, and Yuehan pressed hard-compressed videos with uncompressed videos.
(2020) Video super-resolution techniques.
Dataset
Arini, Bahaweres, and FF++ Xception + ResNet-50 Combines pretrained models for Limited by preprocessing steps No
Al Haq (2022) fast and efficient classification. and binary pattern extraction.
Beuve, Hamidouche, and DFDC, Dummy Triplet Loss Introduces triplet loss to Computationally intensive and Yes
Deforges (2021) Celeb-DF Model improve robustness against sensitive to hyperparameter
adversarial examples. tuning.
Chugh, Gupta, Dhall, and DFDC, Audio-Visual Focuses on audio–visual Requires high-quality and Yes
Subramanian (2020) Celeb-DF Dissonance inconsistencies, enhancing synchronised multimodal data.
multimodal detection.
Bansal et al. (2023b) DFDC, Artifact-Based CNN Targets manipulation artifacts Limited to artifacts visible in No
FF++ for accurate detection. specific datasets.
Maksutov, Morozov, Celeb-DF, Various ML Models Provides insights into the Model performance varies Not Mentioned
Lavrenov, and Smirnov DFDC performance of machine significantly by dataset.
(2020) learning techniques for
detection.
Khan and Dai (2021) FF++, Video Transformer Leverages incremental learning Requires large-scale resources Yes
DFDC to adapt to new datasets and for effective training.
manipulations.
Adnan and Abdulbaqi Celeb-DF, 3D-CNN Simplifies detection by focusing Struggles with detecting subtle No
(2022) DFDC on core convolutional features, manipulations in high-quality
improving processing speed. fakes.
Mira (2023) FF++, RNN Achieves reasonable detection Limited robustness to diverse No
Celeb-DF accuracy with minimal manipulations across datasets.
computational requirements.
Ganguly, Mohiuddin, Celeb-DF, Visual Attention Model Enhances detection by focusing Computationally intensive and Yes
Malakar, Cuevas and DFDC on forgery-critical regions using sensitive to attention weights.
Sarkar (2022) attention mechanisms.
Masud, Sadiq, Masood, DFDC, LW-DeepFakeNet Optimised for real-time Reduced accuracy for complex No
Ahmad, and Abd El-Latif FF++ detection with low manipulations in high-resolution
(2023) computational overhead. videos.
Baxevana-kis, Wild- MeVer Framework Practical insights into real-world Limited generalisability due to No
Kordopatis-Zilos, DeepFake deployment, improving dataset-specific optimisations.
Galopoulos, Apostolidis, scalability.
Levacher, Baris Schlicht,
Teyssou, Kompatsiaris, and
Papadopoulos (2022)
Humidan, Abdullah, and Com- Compression-Aware Handles artifacts in compressed Struggles with uncompressed No
Halin (2022) pressed Model videos, improving detection high-quality fakes.
Video accuracy.
Dataset
Kaddar, Fezza, DFDC, HCiT Model Combines CNN and Transformer Requires extensive tuning and Yes
Hamidouche, Akhtar, and Celeb-DF features for robust detection. high computational resources.
Hadid (2021)
Mitra, Mohanty, Corcoran, Social 3D-CNN Tailored for social media Struggles with high-resolution No
and Kougianos (2020) Media platforms, optimising for manipulations and diverse
Dataset low-resolution videos. datasets.
Gu et al. (2021) FF++, Spatiotemporal CNN Captures temporal Requires extensive training data Yes
DFDC inconsistencies effectively, for accurate predictions.
enhancing sequential detection.
Suratkar, Johnson, Celeb-DF, Transfer Learning CNN Reduces training time while Requires significant fine-tuning Yes
Variyambat, and Panchal UADFV improving generalisation across for new manipulations.
and Kazi (2020) datasets.
Asha, Vinod, and Menon DFDC, Temporal–Spatial Robust against adversarial High computational cost and Yes
(2023) Celeb-DF Defense Model attacks by leveraging both training complexity.
spatial and temporal features.

(continued on next page)

17
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

Table 3 (continued).
Title Datasets Model Used Strengths Weaknesses Generalisability
Jain, Korshunov, and Celeb-DF, Attribution-Based Focuses on model attribution to Requires careful labelling and Yes
Marcel (2021) DFDC Model improve generalisation across large-scale datasets for training.
datasets.
Ahmed and Sonuç (2023) FF++, Rationale-Augmented Enhances detection accuracy by Computationally demanding and No
DFDC CNN integrating rationale-based dependent on quality of
features. rationale extraction.
Ganguly, Ganguly, Celeb-DF, ViXNet Combines Vision Transformer & Requires extensive Yes
Mohiuddin, Malakar and DFDC Xception to enhance computational resources and
Sarkar (2022) spatial–temporal analysis. memory.
Beuve, Hamidouche, and DFDC, Dummy Triplet Loss Improves robustness to Sensitive to hyperparameter Yes
Déforges (2023) Celeb-DF Model adversarial attacks with tuning and training
hierarchical learning. configuration.
Ge et al. (2021) Celeb-DF, Predictive Learns latent patterns effectively High training complexity and Yes
FF++ Representation Learning for robust forgery detection. computational demands.
Zhang, Li, Lin, Zeng, and Celeb-DF, Temporal Dropout Used 3DCNN and 3D Inception High training complexity and Yes
Ge (2021) FF++, 3DCNN (TD-3DCNN) Modules to extract features and tested only on selected datasets.
DFDC Temporal Dropout to leverage
inconsistent cues in video
frames.
Choi, Kim, Jeong, Baek, Celeb-DF, Style latent flow Proposed temporal changes in High data preprocessing time. Yes
and Choi (2024) FF++, (StyleGRU module) style latent vector to generalise
DFD extracted from DF video detection.
consecutive frames of a
video is used as a cue.

hybrid, ViTs, and GAN-based approaches require substantial computing 5.4. RQ4: Can video DeepFake detection be generalised?
resources, particularly for real-time detection of high-resolution videos.
This complexity can also limit the scalability of these models, making The generalisation is probably, one of the major challenges in
it difficult to deploy them in real-world applications where both time- DeepFake detection models.
liness and efficiency are critical (Ge et al., 2022). The balance between
Generalisation potential
accuracy and computational efficiency remains a key area for future
Most of the systematic review papers, as highlighted in Section 2,
research.
failed to discuss strengths and weaknesses of the existing DeepFake
detection methods concerning their ability to generalise on new or
5.3. RQ3: What data sources are utilised to evaluate video DeepFake
unseen manipulations. Lack of such analysis raises difficulties in as-
detection techniques?
sessment of real-world applicability, especially when confronting DF
different from training datasets. These are gaps in the literature that
The choice of datasets in DeepFake detection benchmarks has a
set limits on how deeply one would understand the performance and
significant impact on the ability and resilience of these techniques.
robustness of these detecting techniques in a practical way. As shown in
More importantly, it is crucial in determining the effectiveness and
generalisability of the detection techniques. The dataset used in the Fig. 10 around 46.3% of researchers believe that DeepFake detection is
reviewed literature are included in the second column of the Table 3; generalisable across different types of DF and datasets. However, state-
Common datasets include: of-the-art models still degrade in performance when tested with novel
or previously unobserved manipulations (Chen & Tan, 2021). CNN-
FaceForensics++ (FF++) One of the most popular datasets for Deep- RNN hybrids and ViTs perform quite well, but due to their training
Fake detection is FF++, which has a large collection of fake videos that on specific datasets, they are somewhat limited in generalisation.
consist of different resolutions with both high and low quality deep
fakes. It forms an important benchmark that is very useful to evaluate 5.5. Recommendations
detection algorithms (Liu et al., 2021).
DeepFake detection challenge (DFDC) With a wide range of DF, Importance of standadised datasets with proper scoring system
the DFDC dataset is one of the largest and most extensive datasets After reviewing these studies, we emphasise the need for a stan-
available. Frequently used for assessing the detection accuracy of subtle dardise dataset with an appropriate scoring system. This scoring sys-
facial manipulations and lip-sync in videos (Ge et al., 2022). tem should reflect the types of DeepFakes in the data set and their
complexities.
Celeb-DF
Celeb-DF A high-quality dataset over facial reenactment techniques. Support cross-study analysis
This makes it an excellent testbed for testing sophisticated manipula-
These scoring systems need a proper framework to support the
tion methods as well (Chen & Tan, 2021), particularly when evaluating
analysis of detection models’ performances across different datasets,
the capacity of the models to detect subtle manipulations.
ensuring that variations in DeepFake types and complexities. By stan-
Cross-dataset evaluation: For generalisability, a common strategy is dardising the evaluation process, these scoring systems will enable a
to evaluate models on separate datasets; what Ming Liu et al. referred consistent and reliable evaluation of the performance of new DeepFake
to as cross-dataset evaluation in the following citation: This method detection models.
is used to determine whether a detection model can effectively adapt In addition, these systems will provide a common platform for cross-
across distribution changes and transformation techniques (Hu et al., study comparisons, facilitate the identification of best practices, and en-
2021). The research is valid, as DeepFake mechanisms are getting hance reproducibility in DeepFake research. Ultimately, this approach
better and we need systems that can generalise to the real world so will contribute to the development of more robust and generalisable
in the future models built on those data will be more applicable. detection models.

18
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

Fig. 10. Researchers discuss the potential for generalisability in DeepFake.

6. Conclusion especially significant in cases where real-time detection or response is


necessary, making large-scale deployment difficult at present.
This research provides a comprehensive analysis of DeepFake detec- The challenge of generalising DeepFake detection models remains
tion techniques, highlighting their strengths and weaknesses, as well as significant and requires further research. The survey revealed that
their generalisability, that have not been adequately addressed in the 46.3% of the selected publications recognised that, proposed DeepFake
existing literature. The heatmap and Table 3 offer an overview of the detection techniques could be generalised across various types of DF
primary detection models recently applied in the video DeepFake de- and datasets. It emphasises the need for future research to focus on de-
tection domain, emphasising their strengths, limitations, and challenges veloping more adaptable detection models maintain high accuracy rates
associated with model generalisation. These insights serve as a valuable across different testing environments and various datasets. Researchers
foundation for future research, providing a clear understanding of
must seek more sustainable and generalisable solutions that are not
the current state of the field. Researchers can find key features and
susceptible to overfitting. This includes enhancing data augmentation
algorithms of tested models in a single compiled table, facilitating the
methods and developing more accurate datasets.
design of more robust and effective algorithms.
To overcome these challenges, it is imperative to direct new re-
The findings indicate that detection models designed for specific
datasets face significant limitations, particularly in generalisation of the search efforts to develop standardised datasets accompanied by an
findings to unseen or real-world situations. CNN-based models have appropriate scoring system. These systems should accurately reflect the
shown strong performance, especially in identifying spatial inconsis- types of DeepFakes within the dataset and their associated complex-
tencies in images and videos; their effectiveness tends to deteriorate ities. Furthermore, a well-defined framework is essential to support
when confronted with a DeepFake from a different environment. Fur- for cross-study comparisons, assist in identifying best practices, and
thermore, the tendency of these models to overfit to specific subsets improve reproducibility in DeepFake research. Ultimately, these ap-
of data has been identified as a critical weakness, severely limiting proaches will contribute towards a development of more robust and
their applicability and effectiveness in real-world applications. It is generalisable detection models.
important to note that this review has certain limitations and may not
Abbreviations
include studies published before 2018 or after February 2024.
Furthermore, current research supports multimodal and mixed ap- The following abbreviations are used in this manuscript:
proaches, where different detection methods are integrated to improve
performance. Examples include combining CNN-RNN models or using DF DeepFake
audiovisual models that take advantage of both visual and auditory DDT DeepFake Detection Techniques
cues. These models have demonstrated significant improvements in ro- SLR Systematic Literature Review
bustness due to their ability to capture spatial and temporal anomalies ASARM Adaptive-Support ARM
that are crucial for identifying complex manipulations in video Deep- DL Deep Learning
Fakes. However, as with other advanced models, these models come PRISMA Preferred Reporting Items for Systematic
with substantial computational costs that can hinder their usability Reviews and Meta-Analyses
in real-time applications, especially in environments with restricted VGG Video
resources. For example, deploying such models in edge devices or FF++ Faceforensic ++
latency-sensitive systems remains a significant challenge. NM Not-Mentioned
The current literature on DeepFake detection exhibits limitations, Y Yes
primarily due to the reliance on a limited set of benchmark datasets, 𝑁 No
including FF++, DFDC, and Celeb-DF. These datasets fail to encompass DFDC DeepFake Detection Challenge
the full spectrum of DF manipulations. Hence, models trained on these CNNs Convolutional Neural Networks
datasets often biased toward specific types of DF manipulations and VGG Visual Geometry Group
often fail to understand new forms of manipulation. This lack of diver- GAN generative adversarial networks
sity in training data poses a critical challenge and highlights the need ViTs Vision Transformers
to develop extensive and diverse datasets that mimic a greater variety DCGAN Deep convolutional generative adversarial
of the DeepFake manipulations present in real-world applications. networks
Moreover, sophisticated detection methods, such as transformers, CL Contrastive Learning
GAN-based approaches, and capsule networks, are available but are Long-Short-Term- LSTM
computationally intensive and difficult to deploy. This limitation is Memory

19
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

CRediT authorship contribution statement Bomma-reddy, S., Samyal, T., & Dahiya, S. (2023). Implementation of a deepfake
detection system using convolutional neural networks and adversarial training. In
2023 3rd international conference on intelligent technologies (pp. 1–6). IEEE.
Ramcharan Ramanaharan: Writing – original draft, Visualiza-
Bondi, L., Cannas, E. D., Bestagini, P., & Tubaro, S. (2020). Training strategies and data
tion, Validation, Software, Methodology, Investigation, Formal analy- augmentations in cnn-based deepfake video detection. In 2020 IEEE international
sis, Data curation. Deepani B. Guruge: Writing – review & editing, workshop on information forensics and security (pp. 1–6). IEEE.
Supervision, Resources, Methodology, Formal analysis, Conceptualiza- Burroughs, S. J., Gokaraju, B., Roy, K., & Khoa, L. (2020). Deepfakes detection in videos
using feature engineering techniques in deep learning convolution neural network
tion. Johnson I. Agbinya: Writing – review & editing, Supervision,
frameworks. In 2020 IEEE applied imagery pattern recognition workshop (pp. 1–4).
Resources, Project administration, Methodology. IEEE.
Chang, X., Wu, J., Yang, T., & Feng, G. (2020). Deepfake face image detection based
Declaration of competing interest on improved VGG convolutional neural network. In 2020 39th Chinese control
conference (pp. 7252–7256). IEEE.
Chen, T., Kumar, A., Nagarsheth, P., Sivaraman, G., & Khoury, E. (2020). Generalization
The authors declare that they have no known competing finan- of audio deepfake detection. In Odyssey (pp. 132–137).
cial interests or personal relationships that could have appeared to Chen, H., Lin, Y., Li, B., & Tan, S. (2022). Learning features of intra-consistency and
influence the work reported in this paper. inter-diversity: Keys toward generalizable deepfake detection. IEEE Transactions on
Circuits and Systems for Video Technology, 33(3), 1468–1480.
Chen, B., & Tan, S. (2021). Featuretransfer: Unsupervised domain adaptation for
Acknowledgements cross-domain deepfake detection. Security and Communication Networks, 2021, 1–8.
Chintha, A., Thai, B., Sohrawardi, S. J., Bhatt, K., Hickerson, A., Wright, M., et al.
The authors acknowledge the MIT administration for providing (2020). Recurrent convolutional structures for audio spoof and video deepfake
detection. IEEE Journal of Selected Topics in Signal Processing, 14(5), 1024–1037.
equipment, work space, and other administrative supports.
Cho, B., Le, B. M., Kim, J., Woo, S., Tariq, S., Abuadbba, A., et al. (2023). Towards
understanding of deepfake videos in the wild. In Proceedings of the 32nd ACM
References international conference on information and knowledge management (pp. 4530–4537).
Choi, J., Kim, T., Jeong, Y., Baek, S., & Choi, J. (2024). Exploiting style latent flows
Abbas, F., & Taeihagh, A. (2024). Unmasking deepfakes: A systematic review of for generalizing deepfake video detection. In Proceedings of the IEEE/CVF conference
deepfake detection and generation techniques using artificial intelligence. Expert on computer vision and pattern recognition (pp. 1133–1143).
Systems with Applications, Article 124260. Choudhary, S., Saurav, S., Saini, R., & Singh, S. (2023). Capsule networks for computer
Abdulhamid, M. A., & Hashim, A. N. (2023). Enhanced preprocessing stage for feature vision applications: a comprehensive review. Applied Intelligence: The International
extraction of deepfake detection based on deep learning methods. In 2023 7th Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving
international symposium on innovative approaches in smart technologies (pp. 1–6). Technologies, 53(19), 21799–21826.
IEEE. Chugh, K., Gupta, P., Dhall, A., & Subramanian, R. (2020). Not made for each other-
Abdullah, M. T., & Ali, N. H. M. (2023). DeepFake detection improvement for images audio-visual dissonance-based deepfake detection and localization. In Proceedings
based on a proposed method for local binary pattern of the multiple-channel color of the 28th ACM international conference on multimedia (pp. 439–447).
space. International Journal of Intelligent Engineering & Systems, 16(3). Coccomini, D. A., Caldelli, R., Falchi, F., & Gennaro, C. (2023). On the generalization
of deep learning models in video deepfake detection. Journal of Imaging, 9(5), 89.
Adnan, S. R., & Abdulbaqi, H. A. (2022). Deepfake video detection based on convolu-
Coccomini, D. A., Caldelli, R., Falchi, F., Gennaro, C., & Amato, G. (2022). Cross-
tional neural networks. In 2022 international conference on data science and intelligent
forgery analysis of vision transformers and cnns for deepfake image detection. In
computing (pp. 65–69). IEEE.
Proceedings of the 1st international workshop on multimedia AI against disinformation
Aduwala, S. A., Arigala, M., Desai, S., Quan, H. J., & Eirinaki, M. (2021). Deepfake
(pp. 52–58).
detection using GAN discriminators. In 2021 IEEE seventh international conference
Deng, L., Suo, H., Li, D., et al. (2022). Deepfake video detection based on
on big data computing service and applications (bigDataService) (pp. 69–77). IEEE.
EfficientNet-V2 network. Computational Intelligence and Neuroscience, 2022.
Afchar, D., Nozick, V., Yamagishi, J., & Echizen, I. (2018). Mesonet: a compact facial
Ding, F., Zhu, G., Li, Y., Zhang, X., Atrey, P. K., & Lyu, S. (2021). Anti-forensics for
video forgery detection network. In 2018 IEEE international workshop on information
face swapping videos via adversarial training. IEEE Transactions on Multimedia, 24,
forensics and security (pp. 1–7). IEEE.
3429–3441.
Agarwal, A., Agarwal, A., Sinha, S., Vatsa, M., & Singh, R. (2021). MD-csdnetwork:
Dong, S., Wang, J., Liang, J., Fan, H., & Ji, R. (2022). Explaining deepfake detection by
Multi-domain cross stitched network for deepfake detection. In 2021 16th IEEE
analysing image matching. In European conference on computer vision (pp. 18–35).
international conference on automatic face and gesture recognition (FG 2021) (pp. 1–8).
Springer.
IEEE.
Du, M., Pentyala, S., Li, Y., & Hu, X. (2020). Towards generalizable deepfake detection
Agarwal, S., Farid, H., El-Gaaly, T., & Lim, S.-N. (2020). Detecting deep-fake videos
with locality-aware AutoEncoder. In Proceedings of the 29th ACM international
from appearance and behavior. In 2020 IEEE international workshop on information
conference on information & knowledge management (pp. 325–334). New York, NY,
forensics and security (pp. 1–6). IEEE.
USA: Association for Computing Machinery, http://dx.doi.org/10.1145/3340531.
Ahmed, S. R. A., & Sonuç, E. (2023). RETRACTED ARTICLE: Deepfake detection using
3411892.
rationale-augmented convolutional neural network. Applied Nanoscience, 13(2), Gambini, M., Fagni, T., Falchi, F., & Tesconi, M. (2022). On pushing DeepFake tweet
1485–1493. detection capabilities to the limits. In Proceedings of the 14th ACM web science
Arini, A., Bahaweres, R. B., & Al Haq, J. (2022). Quick classification of xception and conference 2022 (pp. 154–163). New York, NY, USA: Association for Computing
resnet-50 models on deepfake video using local binary pattern. In 2021 international Machinery, http://dx.doi.org/10.1145/3501247.3531560.
seminar on machine learning, optimization, and data science (pp. 254–259). IEEE. Ganguly, S., Ganguly, A., Mohiuddin, S., Malakar, S., & Sarkar, R. (2022). ViXNet:
Asha, S., Vinod, P., & Menon, V. G. (2023). A defensive framework for deepfake de- Vision transformer with xception network for deepfakes based video and image
tection under adversarial settings using temporal and spatial features. International forgery detection. Expert Systems with Applications, 210, Article 118423.
Journal of Information Security, 22(5), 1371–1382. Ganguly, S., Mohiuddin, S., Malakar, S., Cuevas, E., & Sarkar, R. (2022). Visual
Bansal, N., Aljrees, T., Yadav, D. P., Singh, K. U., Kumar, A., Verma, G. K., et attention-based deepfake video forgery detection. Pattern Analysis and Applications,
al. (2023a). Real-time advanced computational intelligence for deep fake video 25(4), 981–992.
detection. Applied Sciences, 13(5), 3095. Gani-yusufoglu, I., Ngô, L. M., Savov, N., Karaoglu, S., & Gevers, T. (2020). Spatio-
Bansal, S., et al. (2023b). Artifact based deepfake detection methods. In 2023 second temporal features for generalized detection of deepfake videos. arXiv preprint
international conference on informatics (pp. 1–6). IEEE. arXiv:2010.11844.
Baxevana-kis, S., Kordopatis-Zilos, G., Galopoulos, P., Apostolidis, L., Levacher, K., Garde, A., Suratkar, S., & Kazi, F. (2022). AI based deepfake detection. In 2022 IEEE
Baris Schlicht, I., et al. (2022). The mever deepfake detection service: Lessons learnt 1st international conference on data, decision and systems (pp. 1–6). IEEE.
from developing and deploying in the wild. In Proceedings of the 1st international Ge, S., Lin, F., Li, C., Zhang, D., Tan, J., Wang, W., et al. (2021). Latent pattern sensing:
workshop on multimedia AI against disinformation (pp. 59–68). Deepfake video detection via predictive representation learning. In Proceedings of
Beuve, N., Hamidouche, W., & Deforges, O. (2021). Dmyt: Dummy triplet loss for deep- the 3rd ACM international conference on multimedia in Asia (pp. 1–7).
fake detection. In Proceedings of the 1st workshop on synthetic multimedia-audiovisual Ge, S., Lin, F., Li, C., Zhang, D., Wang, W., & Zeng, D. (2022). Deepfake video detection
deepfake generation and detection (pp. 17–24). via predictive representation learning. ACM Trans. Multimedia Comput. Commun.
Beuve, N., Hamidouche, W., & Déforges, O. (2023). Hierarchical learning and dummy Appl., 18(2s), http://dx.doi.org/10.1145/3536426.
triplet loss for efficient deepfake detection. ACM Transactions on Multimedia Giudice, O., Guarnera, L., & Battiato, S. (2021). Fighting deepfakes by detecting gan
Computing, Communications and Applications, 20(3), 1–18. dct anomalies. Journal of Imaging, 7(8), 128.
Bhaumik, K. K., & Woo, S. S. (2023). Exploiting inconsistencies in object representations Gong, D., Goh, O. S., Kumar, Y. J., Ye, Z., & Chi, W. (2020). Deepfake forensics, an
for deepfake video detection. In Proceedings of the 2nd workshop on security ai-synthesized detection with deep convolutional generative adversarial networks.
implications of deepfakes and cheapfakes (pp. 11–15). International Journal, 9(3).

20
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

Gong, D., Kumar, Y. J., Goh, O. S., Ye, Z., & Chi, W. (2021). DeepfakeNet, an efficient Khan, S. A., & Dang-Nguyen, D.-T. (2022). Hybrid transformer network for deepfake
deepfake detection method. International Journal of Advanced Computer Science and detection. In Proceedings of the 19th international conference on content-based multime-
Applications, 12(6), 201–207. dia indexing (pp. 8–14). New York, NY, USA: Association for Computing Machinery,
Gu, Z., Chen, Y., Yao, T., Ding, S., Li, J., Huang, F., et al. (2021). Spatiotemporal http://dx.doi.org/10.1145/3549555.3549588.
inconsistency learning for deepfake video detection. In Proceedings of the 29th ACM Khormali, A., & Yuan, J.-S. (2021). Add: Attention-based deepfake detection approach.
international conference on multimedia (pp. 3473–3481). Big Data and Cognitive Computing, 5(4), 49.
Guarnera, L., Giudice, O., Guarnera, F., Ortis, A., Puglisi, G., Paratore, A., et al. (2022). Khormali, A., & Yuan, J.-S. (2022). DFDT: an end-to-end deepfake detection framework
The face deepfake detection challenge. Journal of Imaging, 8(10), 263. using vision transformer. Applied Sciences, 12(6), 2953.
Guefrachi, S., Jabra, M. B., Alsharabi, N. A., Othman, M. T. B., Alharabi, Y. O., Kirn, H., Anwar, M., Sadiq, A., Zeeshan, H. M., Mehmood, I., & Butt, R. A. (2022).
Alkholidi, A., et al. (2023). Deep learning based DeepFake video detection. In 2023 Deepfake tweets detection using deep learning algorithms. Engineering Proceedings,
international conference on smart computing and application (pp. 1–8). IEEE. 20(1), 2.
Guefrechi, S., Jabra, M. B., & Hamam, H. (2022). DeepFake video detection using Lai, Z., Wang, Y., Feng, R., Hu, X., & Xu, H. (2022). Multi-feature fusion based deepfake
InceptionResnetV2. In 2022 6th international conference on advanced technologies for face forgery video detection. Systems, 10(2), 31.
signal and image processing (pp. 1–6). IEEE. Lee, E. G., Lee, I., & Yoo, S.-B. (2023). ClueCatcher: Catching domain-wise independent
Heidari, A., Jafari Navimipour, N., Dag, H., & Unal, M. (2024). Deepfake detection clues for deepfake detection. Mathematics, 11(18), 3952.
using deep learning methods: A systematic and comprehensive review. Wiley Lewis, J. K., Toubal, I. E., Chen, H., Sandesera, V., Lomnitz, M., Hampel-Arias, Z.,
Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 14(2), Article e1520. et al. (2020). Deepfake video detection based on spatial, spectral, and temporal
Heo, Y.-J., Yeo, W.-H., & Kim, B.-G. (2023). Deepfake detection algorithm based inconsistencies using multimodal deep learning. In 2020 IEEE applied imagery pattern
on improved vision transformer. Applied Intelligence: The International Journal of recognition workshop (pp. 1–9). IEEE.
Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies, Li, X., Lang, Y., Chen, Y., Mao, X., He, Y., Wang, S., et al. (2020). Sharp multiple
53(7), 7512–7527. instance learning for DeepFake video detection. In Proceedings of the 28th ACM inter-
Hongmeng, Z., Zhiqiang, Z., Lei, S., Xiuqing, M., & Yuehan, W. (2020). A detection national conference on multimedia (pp. 1864–1872). New York, NY, USA: Association
method for deepfake hard compressed videos based on super-resolution reconstruc- for Computing Machinery, http://dx.doi.org/10.1145/3394171.3414034.
tion using CNN. In Proceedings of the 2020 4th high performance computing and cluster Li, M., Li, X., Yu, K., Deng, C., Huang, H., Mao, F., et al. (2023). Spatio-temporal
technologies conference & 2020 3rd international conference on big data and artificial catcher: A self-supervised transformer for deepfake video detection. In Proceedings
intelligence (pp. 98–103). of the 31st ACM international conference on multimedia (pp. 8707–8718).
Hu, J., Wang, S., & Li, X. (2021). Improving the generalization ability of deepfake Lin, K., Han, W., Li, S., Gu, Z., Zhao, H., & Mei, Y. (2023). Detecting deepfake videos
detection via disentangled representation learning. In 2021 IEEE international using spatiotemporal trident network. ACM Trans. Multimedia Comput. Commun.
conference on image processing (pp. 3577–3581). IEEE. Appl., http://dx.doi.org/10.1145/3623639, Just Accepted.
Liu, J. J., Boongoen, T., & Iam-On, N. (2024). Improved detection of transient events
Humidan, A. S., Abdullah, L. N., & Halin, A. A. (2022). Detection of compressed
in wide area sky survey using convolutional neural networks. Data and Information
DeepFake video drawbacks and technical developments. In 2022 5th international
Management, 8(3), Article 100035.
conference on signal processing and information security (pp. 11–16). IEEE.
Liu, D., Dang, Z., Peng, C., Zheng, Y., Li, S., Wang, N., et al. (2023). FedForgery: gen-
İlhan, I., Balı, E., & Karaköse, M. (2022). An improved deepfake detection approach
eralized face forgery detection with residual federated learning. IEEE Transactions
with nasNetLarge CNN. In 2022 international conference on data analytics for business
on Information Forensics and Security.
and industry (pp. 598–602). IEEE.
Liu, C., Li, J., Duan, J., & Huang, H. (2022). Video forgery detection using spatio-
Ismail, A., Elpeltagy, M., S. Zaki, M., & Eldahshan, K. (2021a). A new deep learning-
temporal dual transformer. In Proceedings of the 2022 11th international conference
based methodology for video deepfake detection using xgboost. Sensors, 21(16),
on computing and pattern recognition (pp. 273–281).
5413.
Liu, J., Zhu, K., Lu, W., Luo, X., & Zhao, X. (2021). A lightweight 3D convolutional
Ismail, A., Elpeltagy, M., Zaki, M., & ElDahshan, K. A. (2021b). Deepfake video
neural network for deepfake detection. International Journal of Intelligent Systems,
detection: YOLO-face convolution recurrent approach. PeerJ Computer Science, 7,
36(9), 4990–5004.
Article e730.
Lomnitz, M., Hampel-Arias, Z., Sandesara, V., & Hu, S. (2020). Multimodal approach
Jada, I., & Mayayise, T. O. (2024). The impact of artificial intelligence on organisational
for deepfake detection. In 2020 IEEE applied imagery pattern recognition workshop
cyber security: An outcome of a systematic literature review. Data and Information
(pp. 1–9). IEEE.
Management, 8(2), Article 100063.
Lugstein, F., Baier, S., Bachinger, G., & Uhl, A. (2021). PRNU-based deepfake detection.
Jain, A., Korshunov, P., & Marcel, S. (2021). Improving generalization of deepfake
In Proceedings of the 2021 ACM workshop on information hiding and multimedia
detection by training for attribution. In 2021 IEEE 23rd international workshop on
security (pp. 7–12).
multimedia signal processing (pp. 1–6). IEEE.
Maksutov, A. A., Morozov, V. O., Lavrenov, A. A., & Smirnov, A. S. (2020). Methods of
Jaleel, Q., & Hadi, I. (2022). Facial action unit-based deepfake video detection using
deepfake detection based on machine learning. In 2020 IEEE conference of Russian
deep learning. In 2022 4th international conference on current research in engineering
Young researchers in electrical and electronic engineering (eIConRus) (pp. 408–411).
and science applications (pp. 228–233). IEEE.
IEEE.
Jia, M., Cheng, X., Lu, S., & Zhang, J. (2022). Learning disentangled representation Malik, A., Kuribayashi, M., Abdullahi, S. M., & Khan, A. N. (2022). DeepFake detection
implicitly via transformer for occluded person re-identification. IEEE Transactions for human face images and videos: A survey. Ieee Access, 10, 18757–18775.
on Multimedia, 25, 1294–1305. Mallet, J., Krueger, N., Dave, R., & Vanamala, M. (2023). Hybrid deepfake detection
Jiang, J., Li, B., Wei, B., Li, G., Liu, C., Huang, W., et al. (2021). FakeFilter: A cross- utilizing MLP and LSTM. In 2023 3rd international conference on electrical, computer,
distribution deepfake detection system with domain adaptation. Journal of Computer communications and mechatronics engineering (pp. 1–5). IEEE.
Security, 29(4), 403–421. Masood, M., Nawaz, M., Malik, K. M., Javed, A., Irtaza, A., & Malik, H. (2023).
Joseph, Z., & Nyirenda, C. (2021). Deepfake detection using a two-stream capsule Deepfakes generation and detection: State-of-the-art, open challenges, countermea-
network. In 2021 IST-africa conference (IST-africa) (pp. 1–8). IEEE. sures, and way forward. Applied Intelligence: The International Journal of Artificial
Jung, T., Kim, S., & Kim, K. (2020a). DeepVision: Deepfakes detection using human eye Intelligence, Neural Networks, and Complex Problem-Solving Technologies, 53(4),
blinking pattern. IEEE Access, 8, 83144–83154. http://dx.doi.org/10.1109/ACCESS. 3974–4026.
2020.2988660. Masud, U., Sadiq, M., Masood, S., Ahmad, M., & Abd El-Latif, A. A. (2023). LW-
Jung, T., Kim, S., & Kim, K. (2020b). Deepvision: Deepfakes detection using human DeepFakeNet: a lightweight time distributed CNN-LSTM network for real-time
eye blinking pattern. IEEE Access, 8, 83144–83154. DeepFake video detection. Signal, Image and Video Processing, 17(8), 4029–4037.
Kaddar, B., Fezza, S. A., Hamidouche, W., Akhtar, Z., & Hadid, A. (2021). Hcit: Mcuba, M., Singh, A., Ikuesan, R. A., & Venter, H. (2023). The effect of deep learning
Deepfake video detection using a hybrid model of CNN features and vision methods on deepfake audio detection for digital investigation. Procedia Computer
transformer. In 2021 international conference on visual communications and image Science, 219, 211–219.
processing (pp. 1–5). IEEE. Mehta, V., Gupta, P., Subramanian, R., & Dhall, A. (2021). Fakebuster: a deepfakes
Karanwal, S., & Diwakar, M. (2023). Triangle and orthogonal local binary pattern for detection tool for video conferencing scenarios. In Companion proceedings of the
face recognition. Multimedia Tools and Applications, 82(23), 36179–36205. 26th international conference on intelligent user interfaces (pp. 61–63).
Kaur, A., Noori Hoshyar, A., Saikrishna, V., Firmin, S., & Xia, F. (2024). Deepfake video Mira, F. (2023). Deep learning technique for recognition of deep fake videos. In 2023
detection: challenges and opportunities. Artificial Intelligence Review, 57(6), 1–47. IEEE IAS global conference on emerging technologies (globConET) (pp. 1–4). IEEE.
Khalil, H. A., & Maged, S. A. (2021). Deepfakes creation and detection using deep Mirsky, Y., & Lee, W. (2021). The creation and detection of deepfakes: A survey. ACM
learning. In 2021 international mobile, intelligent, and ubiquitous computing conference Computing Surveys, 54(1), 1–41.
(pp. 1–4). IEEE. Misirlis, N., & Munawar, H. B. (2023). From deepfake to deep useful: risks and
Khalil, S. S., Youssef, S. M., & Saleh, S. N. (2021). Icaps-dfake: An integrated capsule- opportunities through a systematic literature review. arXiv preprint arXiv:2311.
based model for deepfake image and video detection. Future Internet, 13(4), 15809.
93. Mitra, A., Mohanty, S. P., Corcoran, P., & Kougianos, E. (2020). A novel machine
Khan, S. A., & Dai, H. (2021). Video transformer for deepfake detection with learning based method for deepfake video detection in social media. In 2020 IEEE
incremental learning. In Proceedings of the 29th ACM international conference on international symposium on smart electronic systems (ISES)(formerly iNiS) (pp. 91–96).
multimedia (pp. 1821–1828). IEEE.

21
R. Ramanaharan et al. Data and Information Management xxx (xxxx) xxx

Mitra, A., Mohanty, S. P., Corcoran, P., & Kougianos, E. (2021). A machine learning Suratkar, S., Kazi, F., Sakhalkar, M., Abhyankar, N., & Kshirsagar, M. (2020). Exposing
based approach for deepfake detection in social media through key video frame deepfakes using convolutional neural networks and transfer learning approaches.
extraction. SN Computer Science, 2(2), 98. In 2020 IEEE 17th India council international conference (pp. 1–8). IEEE.
Mohammed, A., & Kora, R. (2023). A comprehensive review on ensemble deep Taeb, M., & Chi, H. (2022). Comparison of deepfake detection techniques through deep
learning: Opportunities and challenges. Journal of King Saud University-Computer learning. Journal of Cybersecurity and Privacy, 2(1), 89–106.
and Information Sciences, 35(2), 757–774. Tang, L., Ye, Q., Hu, H., Xue, Q., Xiao, Y., & Li, J. (2024). DeepMark: A scalable and
Muppalla, S., Jia, S., & Lyu, S. (2023). Integrating audio-visual features for multimodal robust framework for DeepFake video detection. ACM Trans. Priv. Secur., 27(1),
deepfake detection. arXiv preprint arXiv:2310.03827. http://dx.doi.org/10.1145/3629976.
Myvizhi, D., & Pamila, J. (2022). Extensive analysis of deep learning-based deepfake Tariq, S., Lee, S., & Woo, S. S. (2020). A convolutional lstm based residual network
video detection. Journal of Ubiquitous Computing and Communication Technologies, for deepfake video detection. arXiv preprint arXiv:2009.07480.
4(1), 1–8. Vasist, P. N., & Krishnan, S. (2022). Deepfakes: An integrative review of the literature
Nadimpalli, A. V., & Rattani, A. (2023). ProActive DeepFake detection using GAN- and an agenda for future research. Communications of the Association for Information
based visible watermarking. ACM Trans. Multimedia Comput. Commun. Appl., http: Systems, 51(1), 14.
//dx.doi.org/10.1145/3625547, Just Accepted. Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing
Nguyen, X. H., Tran, T. S., Nguyen, K. D., Truong, D.-T., et al. (2021). Learning Systems.
spatio-temporal features to detect manipulated facial videos created by the deepfake Vinay, A., Bhat, N., Khurana, P. S., Lakshminarayanan, V., Nagesh, V., Natarajan, S., et
techniques. Forensic Science International: Digital Investigation, 36, Article 301108. al. (2022). Afmb-net: Deepfake detection network using heart rate analysis. TehniČki
Nirkin, Y., Wolf, L., Keller, Y., & Hassner, T. (2021). Deepfake detection based on Glasnik, 16(4), 503–508.
discrepancies between faces and their context. IEEE Transactions on Pattern Analysis Wang, T., Cheng, H., Chow, K. P., & Nie, L. (2023). Deep convolutional pooling
and Machine Intelligence, 44(10), 6111–6121. transformer for deepfake detection. ACM Transactions on Multimedia Computing,
Page, M., Moher, D., Bossuyt, P., Boutron, I., Hoffmann, T., Mulrow, C., et al. (2021). Communications and Applications, 19(6), 1–20.
PRISMA 2020 explanation and elaboration: Updated guidance and exemplars Wang, Y., & Dantcheva, A. (2020). A video is worth more than 1000 lies. Comparing
for reporting systematic reviews. BMJ, 372, n160. http://dx.doi.org/10.1136/bmj. 3DCNN approaches for detecting deepfakes. In 2020 15Th IEEE international
n160. conference on automatic face and gesture recognition (FG 2020) (pp. 515–519). IEEE.
Passos, L. A., Jodas, D., Costa, K. A., Souza Júnior, L. A., Rodrigues, D., Del Ser, J., Wang, G., Jiang, Q., Jin, X., & Cui, X. (2022). FFR_FD: Effective and fast detection of
et al. (2024). A review of deep learning-based approaches for deepfake content DeepFakes via feature point defects. Information Sciences, 596, 472–488.
detection. Expert Systems, 41(8), Article e13570. Wang, J., Wu, Z., Ouyang, W., Han, X., Chen, J., Jiang, Y.-G., et al. (2022). M2tr:
Pasupuleti, V. R., Tathireddy, P. R., Dontagani, G., & Rahim, S. A. (2023). Deepfake Multi-modal multi-scale transformers for deepfake detection. In Proceedings of the
detection using custom densenet. In 2023 14th international conference on computing 2022 international conference on multimedia retrieval (pp. 615–623).
communication and networking technologies (pp. 1–5). IEEE. Whittaker, L., Mulcahy, R., Letheren, K., Kietzmann, J., & Russell-Bennett, R. (2023).
Pryor, L., Dave, R., Vanamala, M., et al. (2023). Deepfake detection analyzing hybrid Mapping the deepfake landscape for innovation: A multidisciplinary systematic
dataset utilizing CNN and SVM. arXiv preprint arXiv:2302.10280. review and future research agenda. Technovation, 125, Article 102784.
Rahman, A., Siddique, N., Moon, M. J., Tasnim, T., Islam, M., Shahiduzzaman, M., et Xiao, S., Zhang, Z., Yang, J., Wen, J., & Li, Y. (2023). Forgery detection by weighted
al. (2022). Short and low resolution deepfake video detection using cnn. In 2022 complementarity between significant invariance and detail enhancement. ACM
IEEE 10th region 10 humanitarian technology conference (r10-HTC) (pp. 259–264). Trans. Multimedia Comput. Commun. Appl., http://dx.doi.org/10.1145/3605893, Just
IEEE. Accepted.
Rama chandran, S., Nadimpalli, A. V., & Rattani, A. (2021). An experimental evaluation Yadav, S., Bommareddy, S., & Vishwakarma, D. K. (2022). Robust and generalized Deep-
on deepfake detection using deep face recognition. In 2021 international carnahan Fake detection. In 2022 13th international conference on computing communication and
conference on security technology (pp. 1–6). IEEE. networking technologies (pp. 1–6). IEEE.
Rana, M. S., Nobi, M. N., Murali, B., & Sung, A. H. (2022). Deepfake detection: A Yang, T., Chen, K., & Zhong, S. (2023). Deepfake detection using fusion channel
systematic literature review. IEEE Access, 10, 25494–25513. information in a multi-attentional model. In Proceedings of the 2023 Asia conference
Rana, M. S., & Sung, A. H. (2020). Deepfakestack: A deep ensemble-based learning on artificial intelligence, machine learning and robotics (pp. 1–5).
technique for deepfake detection. In 2020 7th IEEE international conference on cyber Younus, M. A., & Hasan, T. M. (2020). Effective and fast deepfake detection method
security and cloud computing (cSCloud)/2020 6th IEEE international conference on edge based on haar wavelet transform. In 2020 international conference on computer
computing and scalable cloud (edgeCom) (pp. 70–75). IEEE. science and software engineering (pp. 186–190). IEEE.
Ranjan, P., Patil, S., & Kazi, F. (2020). Improved generalizability of deep-fakes detection Zhang, D., Li, C., Lin, F., Zeng, D., & Ge, S. (2021). Detecting deepfake videos with
using transfer learning based CNN framework. In 2020 3rd international conference temporal dropout 3Dcnn.. In IJCAI (pp. 1288–1294).
on information and computer technologies (pp. 86–90). IEEE. Zhang, D., Lin, F., Hua, Y., Wang, P., Zeng, D., & Ge, S. (2022). Deepfake video
Saikia, P., Dholaria, D., Yadav, P., Patel, V., & Roy, M. (2022). A hybrid CNN-LSTM detection with spatiotemporal dropout transformer. In Proceedings of the 30th
model for video deepfake detection by leveraging optical flow features. In 2022 ACM international conference on multimedia (pp. 5833–5841). New York, NY,
international joint conference on neural networks (pp. 1–7). IEEE. USA: Association for Computing Machinery, http://dx.doi.org/10.1145/3503161.
Sedaghatjoo, Z., Hosseinzadeh, H., & Bigham, B. S. (2024). Local binary pattern (LBP) 3547913.
optimization for feature extraction. arXiv preprint arXiv:2407.18665. Zhang, Y., Lin, W., & Xu, J. (2024). Joint audio-visual attention with contrastive
Sharma, V. K., Garg, R., & Caudron, Q. (2024). A systematic literature review on learning for more general deepfake detection. ACM Trans. Multimedia Comput.
deepfake detection techniques. Multimedia Tools and Applications, 1–43. Commun. Appl., 20(5), http://dx.doi.org/10.1145/3625100.
Siegel, D., Kraetzer, C., Seidlitz, S., & Dittmann, J. (2021). Media forensics consider- Zhang, Y., Lu, J., Wang, X., Li, Z., Xiao, R., Wang, W., et al. (2022). Deepfake detection
ations on deepfake detection with hand-crafted features. Journal of Imaging, 7(7), system for the ADD challenge track 3.2 based on score fusion. In Proceedings of
108. the 1st international workshop on deepfake detection for audio multimedia (pp. 43–52).
Singh, A., Saimbhi, A., Singh, N., & Mittal, M. (2020). DeepFake video detection: a New York, NY, USA: Association for Computing Machinery, http://dx.doi.org/10.
time-distributed approach. SN comput sci 1: 212. 1145/3552466.3556528.
Stanciu, D. C., & Ionescu, B. (2022). Uncovering the strength of capsule networks in Zhang, D., Wu, P., Li, F., Zhu, W., & Sheng, V. S. (2022). Cascaded-hop for deepfake
deepfake detection. In Proceedings of the 1st international workshop on multimedia AI videos detection. KSII Transactions on Internet and Information Systems (TIIS), 16(5),
against disinformation (pp. 69–77). 1671–1686.
Stanciu, D.-C., & Ionescu, B. (2023). Autoencoder-based data augmentation for deepfake Zhang, W., Zhao, C., & Li, Y. (2020). A novel counterfeit feature extraction technique
detection. In Proceedings of the 2nd ACM International Workshop on Multimedia AI for exposing face-swap images based on deep learning and error level analysis.
against Disinformation (pp. 19–27). Entropy, 22(2), 249.
Stephen, D., & Mantoro, T. (2022). Usage of convolutional neural network for deepfake Zhao, Z., Wang, P., & Lu, W. (2020). Detecting deepfake video by learning two-level
video detection with face-swapping technique. In 2022 5th international conference features with two-stream convolutional neural network. In Proceedings of the 2020
of computer and informatics engineering (IC2IE) (pp. 22–28). IEEE. 6th international conference on computing and artificial intelligence (pp. 291–297).
Stroebel, L., Llewellyn, M., Hartley, T., Ip, T. S., & Ahmed, M. (2023). A systematic Zhao, L., Zhang, M., Ding, H., & Cui, X. (2021). MFF-net: Deepfake detection network
literature review on the effectiveness of deepfake detection techniques. Journal of based on multi-feature fusion. Entropy, 23(12), 1692.
Cyber Security Technology, 7(2), 83–113. Zi, B., Chang, M., Chen, J., Ma, X., & Jiang, Y.-G. (2020). Wilddeepfake: A chal-
Suratkar, S., Johnson, E., Variyambat, K., Panchal, M., & Kazi, F. (2020). Employing lenging real-world dataset for deepfake detection. In Proceedings of the 28th ACM
transfer-learning based CNN architectures to enhance the generalizability of deep- international conference on multimedia (pp. 2382–2390).
fake detection. In 2020 11th international conference on computing, communication
and networking technologies (pp. 1–9). IEEE.

22

You might also like