CAFA (Controllable Automatic Foley Artist) is a controllable text-video-to-audio model for Foley sound generation. Given a short video and a textual prompt, CAFA generates a synchronized audio waveform that matches both the visual content and the desired semantics described in the prompt. This allows users to modify or override the natural sound of the video by changing the prompt, enabling fine-grained control over the generated audio.
This repository provides the inference tools, pretrained weights, and test results to reproduce our results or build upon our work.
demo_video_compressed.mp4
For more examples, visit our demo page
- We recommend working in a fresh enviornment
git clone https://github.com/finmickey/CAFA.git
cd CAFA
# create env
python -m venv env
source env/bin/activate- Install preqrequisite if not installed yet
pip install torch torchvision torchaudio wheel
# Used for downloading the ckpt
git lfs install- Install requirements (using legacy resolver speeds up CUDA dependencies installation, but is optional)
pip install -r requirements.txt --use-deprecated=legacy-resolver- Download ckpts and config files. Notice that we use the avclip model from Synchformer, but use a different config.
mkdir ckpts
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/MichaelFinkelson/CAFA-avclip ckpts/
cd ckpts
git lfs pull
cd ..
wget -O ckpts/avclip.pt https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/sync/sync_models/24-01-04T16-39-21/24-01-04T16-39-21.ptAn inference script is provided at demo.py
python demo.py --video_path <path_to_video_file.mp4> --prompt "your_prompt"The output will be saved at ./output by default as both .wav and combined .mp4 files.
The computed video embeddings will be cached at ./embeds and can be reused for faster generation:
python demo.py --video_path <path_to_video_file.mp4> --embed_path <path_to_embedding_file.npy> --prompt "your_prompt"The model supports generation of up to 10 seconds of audio. Videos longer than that will be trimmed to 10 seconds.
Common parameters:
--cfg: Classifier-free guidance scale (default: 7.0)--steps: Number of diffusion steps (default: 50)--seed: Random seed (default: 42)--asym_cfg: Asymmetric CFG scale (default: 0.5)
For all available options:
python demo.py --helpWe provide the model's generations of the VGGSound test set at this huggingface dataset
@inproceedings{benita2025controllableautomaticfoleyartist,
title={Controllable Automatic Foley Artist},
author={Roi Benita and Michael Finkelson and Tavi Halperin and Gleb Sterkin and Yossi Adi},
year={2025},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
url={https://arxiv.org/abs/2504.06778},
}The code is primarily based on stable-audio-tools and Synchformer.