Official PyTorch Implementation
[arXiv] | [Project Page]
Minglei Shi1*, Haolin Wang1*, Wenzhao Zheng1†, Ziyang Yuan2, Xiaoshi Wu2, Xintao Wang2, Pengfei Wan2, Jie Zhou1, Jiwen Lu1
(*equal contribution, listed in alphabetical order; †project lead)
1Department of Automation, Tsinghua University 2Kling Team, Kuaishou Technology
We introduce SVG, a novel latent diffusion model without variational autoencoders, which unleashes Self-supervised representations for Visual Generation.
Key Components:
- SVG Autoencoder - Uses a frozen representation encoder with a residual branch to compensate the information loss and a learned convolutional decoder to transfer the SVG latent space to pixel space.
- Latent Diffusion Transformer - Performs diffusion modeling directly on SVG latent space.
Repository Features:
- ✅ PyTorch implementation of SVG Autoencoder
- ✅ PyTorch implementation of Latent Diffusion Transformer
- ✅ End-to-end training and sampling scripts
- ✅ Multi-GPU distributed training support
conda create -n svg python=3.10 -y
conda activate svgpip install -r requirements.txtgit clone https://github.com/facebookresearch/dinov3.gitFollow the official DINOv3 repository instructions to download pre-trained checkpoints.
- Download ImageNet-1k
- Update dataset paths in the configuration files
Before running the pipeline, update the placeholder paths in the following configuration files to match your local file/directory structure.
File path: autoencoder/configs/example_svg_autoencoder_vitsp.yaml
Modify the paths under dinoconfig (for DINOv3 dependencies) and train/validation (for dataset) as shown below:
dinoconfig:
dinov3_location: /path/to/your/dinov3 # Path to the directory storing the DINOv3 codebase
model_name: dinov3_vits16plus # Fixed DINOv3 model variant (no need to change)
weights: /path/to/your/dinov3_vits16plus_pretrain.pth # Path to the pre-trained DINOv3 weights file
...
train:
params:
data_root: /path/to/your/ImageNet-1k/ # Root directory of the ImageNet-1k dataset (for training)
validation:
params:
data_root: /path/to/your/ImageNet-1k/ # Root directory of the ImageNet-1k dataset (for validation)File path: configs/example_SVG_XL.yaml
Update the data_path (for training data) and encoder_config (path to the Autoencoder config above) as follows:
basic:
data_path: /path/to/your/ImageNet-1k/train_images # Path to the "train_images" subfolder in ImageNet-1k
encoder_config: ../autoencoder/svg/configs/example_svg_autoencoder_vitsp.yaml # Relative/absolute path to your edited Autoencoder configNote: Ensure the encoder_config path is valid (use an absolute path if the relative path ../ does not match your project’s folder hierarchy). Additionally, the ckpt parameter must be set to the full path of your trained decoder checkpoint file.
cd autoencoder/svg
bash run_train.sh configs/example_svg_autoencoder_vitsp.yamltorchrun --nnodes=1 --nproc_per_node=8 train_svg.py --config ./configs/example_SVG_XL.yamlGenerate images using a trained model:
# Update ckpt_path in sample_svg.py with your checkpoint
python sample_svg.pyGenerated images will be saved to the current directory.
autoencoder/configs/- SVG autoencoder training configurationsconfigs/- Diffusion transformer training configurations
Adjust --nproc_per_node based on your available GPUs. The example uses 8 GPUs.
If you use this work in your research, please cite our paper:
@misc{shi2025latentdiffusionmodelvariational,
title={Latent Diffusion Model without Variational Autoencoder},
author={Minglei Shi and Haolin Wang and Wenzhao Zheng and Ziyang Yuan and Xiaoshi Wu and Xintao Wang and Pengfei Wan and Jie Zhou and Jiwen Lu},
year={2025},
eprint={2510.15301},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.15301},
}This implementation builds upon several excellent open-source projects:
- dinov3 - Dinov3 official architecture
- SigLIP2 - SigLIP2 official architecture
- MAE - MAE baseline architecture
- SiT - Diffusion framework and training codebase
- VAVAE - PyTorch convolutional decoder implementation
For questions and issues, please open an issue on GitHub or contact the authors.