Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views5 pages

Detection Synopsis

This project proposes a multi-stream deep learning framework for detecting AI-generated videos, specifically deepfakes, by integrating spatiotemporal analysis with generative adversarial probing. The architecture utilizes a combination of CNNs, Vision Transformers, Temporal Convolutional Networks, and a novel diffusion model to enhance detection capabilities against evolving manipulation techniques. The approach aims to improve generalization and robustness in identifying manipulated content, addressing the urgent need for reliable video authenticity detection methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views5 pages

Detection Synopsis

This project proposes a multi-stream deep learning framework for detecting AI-generated videos, specifically deepfakes, by integrating spatiotemporal analysis with generative adversarial probing. The architecture utilizes a combination of CNNs, Vision Transformers, Temporal Convolutional Networks, and a novel diffusion model to enhance detection capabilities against evolving manipulation techniques. The approach aims to improve generalization and robustness in identifying manipulated content, addressing the urgent need for reliable video authenticity detection methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

MINOR PROJECT SYNOPSIS

Deep Learning-Based Detection of AI-Generated Videos

Submitted by:

Basharat Kaif (22BCS020)

Arsalaan Ahmed (22BCS012)

7th Semester, B.Tech

Department of Computer Science and Engineering

Jamia Millia Islamia

Under the guidance of Mr. Hannan Mansoor


Deep Learning-Based Detection of AI-Generated Videos

1. Title
Deep Learning-Based Detection of AI-Generated Videos

2. Abstract
The proliferation of sophisticated AI-based video generation and manipulation techniques,
commonly known as deepfakes, presents a significant threat to information integrity,
personal security, and public trust. While numerous deep learning-based detection methods
have been proposed, they often struggle with generalization to unseen manipulation
techniques and "in-the-wild" videos. Our method integrates spatiotemporal analysis with a
generative adversarial probing mechanism. It employs a multi-stream architecture,
including a Convolutional Neural Network (CNN) for high-frequency artifact detection, a
Vision Transformer (ViT) for global contextual inconsistencies, a Temporal Convolutional
Network (TCN) for identifying inter-frame anomalies, and a novel component that utilizes a
diffusion model to "probe" the authenticity of the video content. An attention-based fusion
module will intelligently combine the outputs from these streams to produce a final, robust
classification. This multi-faceted approach is designed to be more resilient to the evolving
nature of AI-generated media and to improve upon the generalization capabilities of current
SOTA models.

3. Introduction
Recent advances in generative artificial intelligence (AI)—driven by powerful models such
as GANs and diffusion networks—have made it possible to synthesize photorealistic videos
that are almost indistinguishable from those captured by real cameras. These synthetic
videos, or deepfakes, are being weaponized in the spread of misinformation, impersonation
fraud, and digital manipulation. As a result, the demand for robust, scalable video
authenticity detection frameworks has become increasingly urgent.

Although early detection systems relied heavily on Convolutional Neural Networks (CNNs)
to detect spatial anomalies, their effectiveness is often limited to the specific types of
manipulations and datasets on which they are trained. More recent developments involve
the use of Vision Transformers (ViTs), which offer improved global context modeling and
have demonstrated superior performance in architectures like Swin-Fake and DFDT. Hybrid
CNN-ViT solutions have also been proposed to combine localized and contextual signals
more effectively.

Temporal modeling techniques such as LSTMs and Temporal Convolutional Networks


(TCNs) have further improved performance by capturing inter-frame inconsistencies like
flickering or unnatural motion. In parallel, frequency-domain features, extracted via
Discrete Cosine Transform (DCT), have been leveraged to identify GAN-specific generation
artifacts and enhance model robustness against adversarial attacks.

A new and promising direction in deepfake detection is the use of generative frameworks
themselves as part of the detection pipeline. For example, DiffusionFake employs a diffusion
model to reverse the generative process and expose subtle manipulation traces. Visual-
language models (VLMs) are also being explored to reformulate detection as a reasoning
problem, enabling capabilities such as zero-shot classification.

4. Proposed Method / Algorithm


We propose a multi-stream, hybrid deep learning framework for AI video detection. The
architecture, as illustrated in the conceptual flowchart below, is designed to capture a wide
range of artifacts and inconsistencies.

Figure 1: Conceptual Flowchart of the Proposed Hybrid Detection Framework.


The video input is processed through four parallel streams:
1. Spatial Inconsistency Stream: This stream uses a hybrid CNN-ViT model. A CNN
backbone (e.g., EfficientNet) will extract low-level features and potential
compression artifacts. These features will then be fed into a Vision Transformer to
model long-range dependencies and global inconsistencies in the frame.
2. Temporal Anomaly Stream: A Temporal Convolutional Network (TCN) will be used
to analyze the sequence of frames. This stream will focus on detecting temporal
artifacts such as flickering, unnatural facial movements, and inconsistent heart rate
signals that can be inferred from video.
3. Frequency Domain Stream: This stream will apply a Discrete Cosine Transform
(DCT) to the video frames to extract frequency domain features. This is particularly
effective for detecting certain types of GAN-based artifacts and can improve
robustness against adversarial attacks, as shown by Hooda et al. [8].
4. Generative Adversarial Probing Stream: This novel stream is inspired by recent
work on using generative models for detection, particularly the DiffusionFake
framework [9]. We will use a pre-trained latent diffusion model. The video frames
will be passed through the diffusion model's encoder, and we will analyze the
reconstruction error and the latent space representation. The intuition is that
authentic videos will have lower reconstruction error and a more "natural" latent
representation compared to manipulated videos.
The outputs from these four streams, which represent different "views" of the video's
authenticity, will be fed into an Attention-based Fusion Module. This module will learn to
dynamically weigh the importance of each stream's output for a given video, allowing the
model to adapt to different types of manipulations. The final output will be a classification
score indicating the probability of the video being real or AI-generated/manipulated

5. Programming Environment & Tools used


- Framework: PyTorch 2.1
- Backbones: ResNet-50 (spatial), Video Swin Transformer (temporal)
- Optical Flow Estimation: RAFT (Recurrent All-Pairs Field Transforms)
- Dataset Format: MP4 videos, preprocessed to extract 16–32 frames
- Training Tools: Google Colab, Weights & Biases for experiment tracking
- Libraries: OpenCV, TorchVision, TimM, NumPy, Matplotlib
- Evaluation: Accuracy, AUC, F1-score, Confusion Matrix on validation/test splits

6. References
[1] G. Tsaloli, E. Bampis, B. Moser, and A. C. Bovik, "A Multi-Modal In-the-Wild Benchmark of
Deepfakes Circulated in 2024," arXiv preprint arXiv:2503.02857, 2025.

[2] T. Le and S. Woo, "iFakeDetector: Real Time Integrated Web-based Deepfake Detection
System," in Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI),
2024.

[3] R. Pandey and A. K. S. Kushwaha, "Detecting deepfake videos: an enhanced hybrid deep
learning model," Multimedia Tools and Applications, 2024.

[4] K. Mishra, "Can You Spot the AI? A Journey into Video Detection Challenges," Medium, May
26, 2025

[5] Y. Wang et al., "Swin-Fake: A Consistency Learning Transformer-Based Deepfake Video


Detector," Electronics, vol. 13, no. 15, p. 3045, 2024.

[6] A. Kumar et al., "Lightweight and hybrid transformer-based solution for quick and reliable
deepfake detection," Scientific Reports, vol. 15, no. 1, 2025.

[7] H. Heo et al., "DFDT: An End-to-End DeepFake Detection Framework Using Vision
Transformer," IEEE Transactions on Circuits and Systems for Video Technology, 2022.
[8] R. Hooda, M. Gupta, and N. Chand, "D4: Detection of Adversarial Diffusion Deepfakes Using
Disjoint Ensembles," in Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision (WACV), 2024, pp. 642-651.

[9] S. Kim, J. Lee, and J. Kim, "DiffusionFake: Enhancing Generalization in Deepfake Detection via
Guided Stable Diffusion," arXiv preprint arXiv:2410.04372, 2024.

[10] Z. Zhang et al., "Visual Language Models as Zero-Shot Deepfake Detectors," arXiv preprint
arXiv:2507.22469, 2025.

[11] S. Li et al., "DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection," GitHub


repository, 2022.

[12] R. Sha et al., "An Analysis of Recent Advances in Deepfake Image Detection in an Evolving
Threat Landscape," arXiv preprint arXiv:2404.16212, 2024

You might also like