Representation Alignment for Generation:
Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu¹ · Sangkyung Kwak¹ · Huiwon Jang¹ · Jongheon Jeong²
Jonathan Huang³ · Jinwoo Shin^1* · Saining Xie^4*
¹ KAIST ²Korea University ³Scaled Foundations ⁴New York University
^*Equal Advising

[project page] [arXiv]

Summary: We propose REPresentation Alignment (REPA), a method that aligns noisy input states in diffusion models with representations from pretrained visual encoders. This significantly improves training efficiency and generation quality. REPA speeds up SiT training by 17.5x and achieves state-of-the-art FID=1.42.

1. Environment setup

conda create -n repa python=3.9 -y
conda activate repa
pip install -r requirements.txt

2. Dataset

Dataset download

Currently, we provide experiments for ImageNet. You can place the data that you want and can specifiy it via --data-dir arguments in training scripts. Please refer to our preprocessing guide.

3. Training

accelerate launch train.py \
  --report-to="wandb" \
  --allow-tf32 \
  --mixed-precision="fp16" \
  --seed=0 \
  --path-type="linear" \
  --prediction="v" \
  --weighting="uniform" \
  --model="SiT-XL/2" \
  --enc-type="dinov2-vit-b" \
  --proj-coeff=0.5 \
  --encoder-depth=8 \
  --output-dir="exps" \
  --exp-name="linear-dinov2-b-enc8" \
  --data-dir=[YOUR_DATA_PATH]

Then this script will automatically create the folder in exps to save logs and checkpoints. You can adjust the following options:

--models: [SiT-B/2, SiT-L/2, SiT-XL/2]
--enc-type: [dinov2-vit-b, dinov2-vit-l, dinov2-vit-g, dinov1-vit-b, mocov3-vit-b, , mocov3-vit-l, clip-vit-L, jepa-vit-h, mae-vit-l]
--proj-coeff: Any values larger than 0
--encoder-depth: Any values between 1 to the depth of the model
--output-dir: Any directory that you want to save checkpoints and logs
--exp-name: Any string name (the folder will be created under output-dir)

For DINOv2 models, it will be automatically downloaded from torch.hub. For CLIP models, it will be also automatically downloaded from the CLIP repository. For other pretrained visual encoders, please download the model weights from the below links and place into the following directories with these names:

dinov1: Download the ViT-B/16 model from the DINO repository and place it as ./ckpts/dinov1_vitb.pth
mocov3: Download the ViT-B/16 or ViT-L/16 model from the RCG repository and place them as ./ckpts/mocov3_vitb.pth or ./ckpts/mocov3_vitl.pth
jepa: Download the ViT-H/14 model (ImageNet-1K) from the I-JEPA repository and place it as ./ckpts/ijepa_vith.pth
mae: Download the ViT-L model from MAE repository and place it as ./ckpts/mae_vitl.pth

[12/17/2024]: We also support training on 512x512 resolution (ImageNet) and a text-to-image generation on MS-COCO.

For ImageNet 512x512, please use the following script:

accelerate launch train.py \
  --report-to="wandb" \
  --allow-tf32 \
  --mixed-precision="fp16" \
  --seed=0 \
  --path-type="linear" \
  --prediction="v" \
  --weighting="uniform" \
  --model="SiT-XL/2" \
  --enc-type="dinov2-vit-b" \
  --proj-coeff=0.5 \
  --encoder-depth=8 \
  --output-dir="exps" \
  --exp-name="linear-dinov2-b-enc8-in512" \
  --resolution=512 \
  --data-dir=[YOUR_DATA_PATH]

You also need a new data preprocessing that resizes each image to 512x512 resolution and encodes each image as 64x64 resolution latent vectors (using stable-diffusion VAE). This script is also provided in our preprocessing guide.

For text-to-image generation, please follow the data preprocessing protocol in U-ViT before lanuching experiments. After that, you should be able to lanuch an experiment through the following script:

accelerate launch train_t2i.py \
  --report-to="wandb" \
  --allow-tf32 \
  --mixed-precision="fp16" \
  --seed=0 \
  --path-type="linear" \
  --prediction="v" \
  --weighting="uniform" \
  --enc-type="dinov2-vit-b" \
  --proj-coeff=0.5 \
  --encoder-depth=8 \
  --output-dir="exps" \
  --exp-name="t2i_repa" \
  --data-dir=[YOUR_DATA_PATH]

4. Evaluation

You can generate images (and the .npz file can be used for ADM evaluation suite) through the following script:

torchrun --nnodes=1 --nproc_per_node=8 generate.py \
  --model SiT-XL/2 \
  --num-fid-samples 50000 \
  --ckpt YOUR_CHECKPOINT_PATH \
  --path-type=linear \
  --encoder-depth=8 \
  --projector-embed-dims=768 \
  --per-proc-batch-size=64 \
  --mode=sde \
  --num-steps=250 \
  --cfg-scale=1.8 \
  --guidance-high=0.7

We also provide the SiT-XL/2 checkpoint (trained for 4M iterations) used in the final evaluation. It will be automatically downloaded if you do not specify --ckpt.

Note

It's possible that this code may not accurately replicate the results outlined in the paper due to potential human errors during the preparation and cleaning of the code for release. If you encounter any difficulties in reproducing our findings, please don't hesitate to inform us. Additionally, we'll make an effort to carry out sanity-check experiments in the near future.

Acknowledgement

This code is mainly built upon DiT, SiT, edm2, and RCG repositories.
We also appreciate Kyungmin Lee for providing the initial version of the implementation.

BibTeX

@inproceedings{yu2025repa,
  title={Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think},
  author={Sihyun Yu and Sangkyung Kwak and Huiwon Jang and Jongheon Jeong and Jonathan Huang and Jinwoo Shin and Saining Xie},
  year={2025},
  booktitle={International Conference on Learning Representations},
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
models		models
preprocessing		preprocessing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
generate.py		generate.py
generate_t2i.py		generate_t2i.py
loss.py		loss.py
requirements.txt		requirements.txt
samplers.py		samplers.py
samplers_t2i.py		samplers_t2i.py
train.py		train.py
train_t2i.py		train_t2i.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Representation Alignment for Generation:
Training Diffusion Transformers Is Easier Than You Think

[project page] [arXiv]

1. Environment setup

2. Dataset

Dataset download

3. Training

4. Evaluation

Note

Acknowledgement

BibTeX

About

Uh oh!

Releases

Packages

Languages

Uh oh!

License

Uh oh!

ascndia/REPA

Folders and files

Latest commit

History

Repository files navigation

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

[project page] [arXiv]

1. Environment setup

2. Dataset

Dataset download

3. Training

4. Evaluation

Note

Acknowledgement

BibTeX

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Representation Alignment for Generation:
Training Diffusion Transformers Is Easier Than You Think

Packages