Jonathan Huang3 · Jinwoo Shin1* · Saining Xie4*
1 KAIST 2Korea University 3Scaled Foundations 4New York University
*Equal Advising
[project page] [arXiv]
Summary: We propose REPresentation Alignment (REPA), a method that aligns noisy input states in diffusion models with representations from pretrained visual encoders. This significantly improves training efficiency and generation quality. REPA speeds up SiT training by 17.5x and achieves state-of-the-art FID=1.42.
conda create -n repa python=3.9 -y
conda activate repa
pip install -r requirements.txtCurrently, we provide experiments for ImageNet. You can place the data that you want and can specifiy it via --data-dir arguments in training scripts. Please refer to our preprocessing guide.
accelerate launch train.py \
--report-to="wandb" \
--allow-tf32 \
--mixed-precision="fp16" \
--seed=0 \
--path-type="linear" \
--prediction="v" \
--weighting="uniform" \
--model="SiT-XL/2" \
--enc-type="dinov2-vit-b" \
--proj-coeff=0.5 \
--encoder-depth=8 \
--output-dir="exps" \
--exp-name="linear-dinov2-b-enc8" \
--data-dir=[YOUR_DATA_PATH]Then this script will automatically create the folder in exps to save logs and checkpoints. You can adjust the following options:
--models:[SiT-B/2, SiT-L/2, SiT-XL/2]--enc-type:[dinov2-vit-b, dinov2-vit-l, dinov2-vit-g, dinov1-vit-b, mocov3-vit-b, , mocov3-vit-l, clip-vit-L, jepa-vit-h, mae-vit-l]--proj-coeff: Any values larger than 0--encoder-depth: Any values between 1 to the depth of the model--output-dir: Any directory that you want to save checkpoints and logs--exp-name: Any string name (the folder will be created underoutput-dir)
For DINOv2 models, it will be automatically downloaded from torch.hub. For CLIP models, it will be also automatically downloaded from the CLIP repository. For other pretrained visual encoders, please download the model weights from the below links and place into the following directories with these names:
dinov1: Download the ViT-B/16 model from theDINOrepository and place it as./ckpts/dinov1_vitb.pthmocov3: Download the ViT-B/16 or ViT-L/16 model from theRCGrepository and place them as./ckpts/mocov3_vitb.pthor./ckpts/mocov3_vitl.pthjepa: Download the ViT-H/14 model (ImageNet-1K) from theI-JEPArepository and place it as./ckpts/ijepa_vith.pthmae: Download the ViT-L model fromMAErepository and place it as./ckpts/mae_vitl.pth
[12/17/2024]: We also support training on 512x512 resolution (ImageNet) and a text-to-image generation on MS-COCO.
For ImageNet 512x512, please use the following script:
accelerate launch train.py \
--report-to="wandb" \
--allow-tf32 \
--mixed-precision="fp16" \
--seed=0 \
--path-type="linear" \
--prediction="v" \
--weighting="uniform" \
--model="SiT-XL/2" \
--enc-type="dinov2-vit-b" \
--proj-coeff=0.5 \
--encoder-depth=8 \
--output-dir="exps" \
--exp-name="linear-dinov2-b-enc8-in512" \
--resolution=512 \
--data-dir=[YOUR_DATA_PATH]You also need a new data preprocessing that resizes each image to 512x512 resolution and encodes each image as 64x64 resolution latent vectors (using stable-diffusion VAE). This script is also provided in our preprocessing guide.
For text-to-image generation, please follow the data preprocessing protocol in U-ViT before lanuching experiments. After that, you should be able to lanuch an experiment through the following script:
accelerate launch train_t2i.py \
--report-to="wandb" \
--allow-tf32 \
--mixed-precision="fp16" \
--seed=0 \
--path-type="linear" \
--prediction="v" \
--weighting="uniform" \
--enc-type="dinov2-vit-b" \
--proj-coeff=0.5 \
--encoder-depth=8 \
--output-dir="exps" \
--exp-name="t2i_repa" \
--data-dir=[YOUR_DATA_PATH]You can generate images (and the .npz file can be used for ADM evaluation suite) through the following script:
torchrun --nnodes=1 --nproc_per_node=8 generate.py \
--model SiT-XL/2 \
--num-fid-samples 50000 \
--ckpt YOUR_CHECKPOINT_PATH \
--path-type=linear \
--encoder-depth=8 \
--projector-embed-dims=768 \
--per-proc-batch-size=64 \
--mode=sde \
--num-steps=250 \
--cfg-scale=1.8 \
--guidance-high=0.7We also provide the SiT-XL/2 checkpoint (trained for 4M iterations) used in the final evaluation. It will be automatically downloaded if you do not specify --ckpt.
It's possible that this code may not accurately replicate the results outlined in the paper due to potential human errors during the preparation and cleaning of the code for release. If you encounter any difficulties in reproducing our findings, please don't hesitate to inform us. Additionally, we'll make an effort to carry out sanity-check experiments in the near future.
This code is mainly built upon DiT, SiT, edm2, and RCG repositories.
We also appreciate Kyungmin Lee for providing the initial version of the implementation.
@inproceedings{yu2025repa,
title={Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think},
author={Sihyun Yu and Sangkyung Kwak and Huiwon Jang and Jongheon Jeong and Jonathan Huang and Jinwoo Shin and Saining Xie},
year={2025},
booktitle={International Conference on Learning Representations},
}