SVG: Latent Diffusion Model without Variational Autoencoder

_{Official PyTorch Implementation}

[arXiv] | [Project Page]

Minglei Shi^1*, Haolin Wang^1*, Wenzhao Zheng^1†, Ziyang Yuan², Xiaoshi Wu², Xintao Wang², Pengfei Wan², Jie Zhou¹, Jiwen Lu¹
(*equal contribution, listed in alphabetical order; †project lead)
¹Department of Automation, Tsinghua University ²Kling Team, Kuaishou Technology

🧠 Overview

We introduce SVG, a novel latent diffusion model without variational autoencoders, which unleashes Self-supervised representations for Visual Generation.

Key Components:

SVG Autoencoder - Uses a frozen representation encoder with a residual branch to compensate the information loss and a learned convolutional decoder to transfer the SVG latent space to pixel space.
Latent Diffusion Transformer - Performs diffusion modeling directly on SVG latent space.

Repository Features:

✅ PyTorch implementation of SVG Autoencoder
✅ PyTorch implementation of Latent Diffusion Transformer
✅ End-to-end training and sampling scripts
✅ Multi-GPU distributed training support

⚙️ Installation

1. Create Environment

conda create -n svg python=3.10 -y
conda activate svg

2. Install Dependencies

pip install -r requirements.txt

📦 Data Preparation

1. Download DINOv3

git clone https://github.com/facebookresearch/dinov3.git

Follow the official DINOv3 repository instructions to download pre-trained checkpoints.

2. Prepare Dataset

Download ImageNet-1k
Update dataset paths in the configuration files

🚀 Quick Start

1. Configure Paths

Before running the pipeline, update the placeholder paths in the following configuration files to match your local file/directory structure.

1.1 Autoencoder Config

File path: autoencoder/configs/example_svg_autoencoder_vitsp.yaml

Modify the paths under dinoconfig (for DINOv3 dependencies) and train/validation (for dataset) as shown below:

dinoconfig:
  dinov3_location: /path/to/your/dinov3  # Path to the directory storing the DINOv3 codebase
  model_name: dinov3_vits16plus          # Fixed DINOv3 model variant (no need to change)
  weights: /path/to/your/dinov3_vits16plus_pretrain.pth  # Path to the pre-trained DINOv3 weights file
...
train:
  params:
    data_root: /path/to/your/ImageNet-1k/  # Root directory of the ImageNet-1k dataset (for training)
validation:
  params:
    data_root: /path/to/your/ImageNet-1k/  # Root directory of the ImageNet-1k dataset (for validation)

1.2 Diffusion Config

File path: configs/example_SVG_XL.yaml

Update the data_path (for training data) and encoder_config (path to the Autoencoder config above) as follows:

basic:
  data_path: /path/to/your/ImageNet-1k/train_images  # Path to the "train_images" subfolder in ImageNet-1k
  encoder_config: ../autoencoder/svg/configs/example_svg_autoencoder_vitsp.yaml  # Relative/absolute path to your edited Autoencoder config

Note: Ensure the encoder_config path is valid (use an absolute path if the relative path ../ does not match your project’s folder hierarchy). Additionally, the ckpt parameter must be set to the full path of your trained decoder checkpoint file.

2. Train SVG Autoencoder

cd autoencoder/svg
bash run_train.sh configs/example_svg_autoencoder_vitsp.yaml

3. Train Latent Diffusion Transformer

torchrun --nnodes=1 --nproc_per_node=8 train_svg.py --config ./configs/example_SVG_XL.yaml

🎨 Image Generation

Generate images using a trained model:

# Update ckpt_path in sample_svg.py with your checkpoint
python sample_svg.py

Generated images will be saved to the current directory.

🛠️ Configuration

Key Configuration Files:

autoencoder/configs/ - SVG autoencoder training configurations
configs/ - Diffusion transformer training configurations

Multi-GPU Training:

Adjust --nproc_per_node based on your available GPUs. The example uses 8 GPUs.

📄 Citation

If you use this work in your research, please cite our paper:

@misc{shi2025latentdiffusionmodelvariational,
      title={Latent Diffusion Model without Variational Autoencoder}, 
      author={Minglei Shi and Haolin Wang and Wenzhao Zheng and Ziyang Yuan and Xiaoshi Wu and Xintao Wang and Pengfei Wan and Jie Zhou and Jiwen Lu},
      year={2025},
      eprint={2510.15301},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.15301}, 
}

🙏 Acknowledgments

This implementation builds upon several excellent open-source projects:

dinov3 - Dinov3 official architecture
SigLIP2 - SigLIP2 official architecture
MAE - MAE baseline architecture
SiT - Diffusion framework and training codebase
VAVAE - PyTorch convolutional decoder implementation

📧 Contact

For questions and issues, please open an issue on GitHub or contact the authors.

_{Made with ❤️ by the SVG Team}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
autoencoder		autoencoder
config		config
figs		figs
models		models
rectified_flow		rectified_flow
.gitignore		.gitignore
README.md		README.md
dinov3_sp_stats.pt		dinov3_sp_stats.pt
environment.yml		environment.yml
requirements.txt		requirements.txt
sample_svg.py		sample_svg.py
train_svg.py		train_svg.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SVG: Latent Diffusion Model without Variational Autoencoder

[arXiv] | [Project Page]

🧠 Overview

⚙️ Installation

1. Create Environment

2. Install Dependencies

📦 Data Preparation

1. Download DINOv3

2. Prepare Dataset

🚀 Quick Start

1. Configure Paths

1.1 Autoencoder Config

1.2 Diffusion Config

2. Train SVG Autoencoder

3. Train Latent Diffusion Transformer

🎨 Image Generation

🛠️ Configuration

Key Configuration Files:

Multi-GPU Training:

📄 Citation

🙏 Acknowledgments

📧 Contact

About

Uh oh!

Releases

Packages

Contributors 2

Languages

shiml20/SVG

Folders and files

Latest commit

History

Repository files navigation

SVG: Latent Diffusion Model without Variational Autoencoder

[arXiv] | [Project Page]

🧠 Overview

⚙️ Installation

1. Create Environment

2. Install Dependencies

📦 Data Preparation

1. Download DINOv3

2. Prepare Dataset

🚀 Quick Start

1. Configure Paths

1.1 Autoencoder Config

1.2 Diffusion Config

2. Train SVG Autoencoder

3. Train Latent Diffusion Transformer

🎨 Image Generation

🛠️ Configuration

Key Configuration Files:

Multi-GPU Training:

📄 Citation

🙏 Acknowledgments

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages