Soumyya Kanti Datta, Tanvi Ranga, Chengzhe Sun, Siwei Lyu
Accepted by APAI ICCV 2025
The rise of manipulated media has made deepfakes a particularly insidious threat, involving various generative manipulations such as lip-sync modifications, face-swaps, and avatar-driven facial synthesis. Conventional detection methods, which predominantly depend on manually designed phoneme–viseme alignment thresholds, fundamental frame-level consistency checks, or a unimodal detection strategy, inadequately identify modern-day deepfakes generated by advanced generative models such as GANs, diffusion models, and neural rendering techniques. These advanced techniques generate nearly perfect individual frames yet inadvertently create minor temporal discrepancies frequently overlooked by traditional detectors. We present a novel multimodal audio-visual framework, Phoneme-Temporal and Identity-Dynamic Analysis(PIA), incorporating language, dynamic face motion, and facial identification cues to address these limitations. We utilize phoneme sequences, lip geometry data, and advanced facial identity embeddings. This integrated method significantly improves the detection of subtle deepfake alterations by identifying inconsistencies across multiple complementary modalities.
Python 3.12- Install necessary packages using
pip install -r requirements.txt. - apt-get update -y && apt-get install -y espeak-ng
- Model weights can be found under ./checkpoints folder
- The input video should have the face of only 1 subject in the entire video.
- The input video should have 1 face per frame.
The input_video should be in mp4 format :
python main.py --video {input_video_path} --outdir {output_path}
@misc{datta2025piadeepfakedetectionusing,
title={PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis},
author={Soumyya Kanti Datta and Tanvi Ranga and Chengzhe Sun and Siwei Lyu},
year={2025},
eprint={2510.14241},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.14241},
}
This work is supported by the Center for Identification Technology Research (CITeR) and the National Science Foundation under Grant No. 1822190