[project page] [paper]
- [2025.07.11] We updated the PCA visualization code in our paper!
- [2025.06.14] We updated the results and checkpoint of SiT+SRA on ImageNet 512x512!
- [2025.05.06] We have released the paper and code of SRA!
-
Diffusion transformer itself to provide representation guidance: We assume the unique discriminative process of diffusion transformer makes it possible to provide the guidance without introducing extraneous representation component.
-
Self-Representation Alignment (SRA): SRA aligns the output latent representation of the diffusion transformers in earlier layer with higher noise to that in later layer with lower noise to achieve self-representation alignment.
-
Improved Performance. SRA accelerates training and improves generation performance for both DiTs and SiTs.
conda create -n sra python=3.12 -y
conda activate sra
pip install -r requirements.txtCurrently, we provide experiments for ImageNet. You can place the data that you want and can specify it via --data-dir arguments in training scripts.
Note that we preprocess the data for faster training. Please refer to preprocessing guide for detailed guidance.
Here we provide the training code for SiTs and DiTs.
cd SiT-SRA
accelerate launch --config_file configs/default.yaml train.py \
--mixed-precision="fp16" \
--seed=0 \
--path-type="linear" \
--prediction="v" \
--resolution=256 \
--batch-size=32 \
--weighting="uniform" \
--model="SiT-XL/2" \
--block-out-s=8 \
--block-out-t=20 \
--t-max=0.2 \
--output-dir="exps" \
--exp-name="sitxl-ab820-t0.2-res256" \
--data-dir=[YOUR_DATA_PATH]Then this script will automatically create the folder in exps to save logs,samples, and checkpoints. You can adjust the following options:
--models: Choosing from [SiT-B/2, SiT-L/2, SiT-XL/2]--block-out-s: Student's output block layer for alignment--block-out-t: Teacher's output block layer for alignment--t-max: Maximum time interval for alignment (we only use dynamic interval here)--output-dir: Any directory that you want to save checkpoints, samples, and logs--exp-name: Any string name (the folder will be created underoutput-dir)--batch-size: The local batch size (by default we use 1 node of 8 GPUs), you need to adjust this value according to your GPU number to make total batch size of 256
cd DiT-SRA
accelerate launch --config_file configs/default.yaml train.py \
--mixed-precision="fp16" \
--seed=0 \
--resolution=256 \
--batch-size=32 \
--model="DiT-XL/2" \
--block-out-s=8 \
--block-out-t=16 \
--t-max=0.2 \
--output-dir="exps" \
--exp-name="ditxl-ab816-t0.2-res256" \
--data-dir=[YOUR_DATA_PATH]Then this script will automatically create the folder in exps to save logs and checkpoints. You can adjust the following options (others are same as above SiTs):
--models: Choosing from [DiT-B/2, DiT-L/2, DiT-XL/2]
Here we provide the generating code for SiTs and DiTs to get the samples for evaluation. (and the .npz file can be used for ADM evaluation suite) through the following script:
You can download our pretrained model here:
| Model | Image Resolution | Epochs | FID-50K | Inception Score |
|---|---|---|---|---|
| SiT-XL/2 + SRA | 512x512 | 400 | 2.07 | 302.2 |
| SiT-XL/2 + SRA | 256x256 | 800 | 1.58 | 311.4 |
cd SiT-SRA
bash gen.shNote that there are several options in gen.sh file that you need to complete:
SAMPLE_DIR: Base directory to save the generated images and .npz fileCKPT: Checkpoint path (This can also be your downloaded local file of the ckpt file we provide above)
And for ImageNet 512x512 with CFG, we use the guidance scale of 2.5 with the guidance interval, which is a little bit different from hyperparameters used in ImageNet 256x256.
cd DiT-SRA
bash gen.shWe provide the PCA vis code of SiTs (256x256) to help to get the similar visualization results as shown in our paper.
cd pca-vis
python main_pca.py \
--ckpt=[YOUR_CKPT_PATH] \
--baseline=False \You need to complete the following options (others in main_pca.py can also be changed):
--ckpt: Checkpoint path (This can also be your downloaded local file of the ckpt file we provide above)--baseline: Whether to use baseline, set it to 'False' if you do not use the ckpt file provided in SiT repo
It's possible that this code may not accurately replicate the results outlined in the paper due to potential human errors during the preparation and cleaning of the code for release as well as the difference of the hardware facility. If you encounter any difficulties in reproducing our findings, please don't hesitate to inform us.
This code is mainly built upon REPA, DiT, SiT repositories. Thanks for their solid work!
If you find SRA useful, please kindly cite our paper:
@article{jiang2025sra,
title={No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves},
author={Jiang, Dengyang and Wang, Mengmeng and Li, Liuzhuozheng and Zhang, Lei and Wang, Haoyu and Wei, Wei and Dai, Guang and Zhang, Yanning and Wang, Jingdong},
journal={arXiv preprint arXiv:2505.02831},
year={2025}
}