Visual-GPT

This repository provides a PyTorch implementation of Vector-Quantization (VQ)-based Autoregressive Vision Generative Models. Built upon the foundational VQ-GAN architecture, this repo integrates several improved techniques for training visual tokenizers, with a purpose of facilitating replication and research in autoregressive vision generative models.

Results and pretrained model checkpoints of class-conditioned image synthesizers on StanfordDogs, a dataset consisting of 120 dog breeds, are provided below.

Usage

Setup

conda create -n vgpt python=3.9
cd visual-gpt
pip install -r requirements.txt

Synthesize Your Dog Images!

To use the pretrained model, download directly from this link, and run:

bash scripts/sample_vgpt.sh \
    --from_pretrained path/to/vqgan-stfdogs \
    --cls_name maltese samoyed australian_terrier \ # dog classes
    --accept_n 10 \ # choose best 1/accept_n with classifier-rejection
    --temperature 1.3 --top_k 100 --top_p 0.9 # generation params.

This will generate a GIF visualizing the autoregressive decoding process, similar to the one shown at the top of this page.

Training a Visual Tokenizer

The provided visual tokenizers are convolutional autoencoder-based. Implementations include a basic structure from the original VQ-VAE paper and a more advanced VQ-GAN architecture adapted from taming transformers. All configuration parameters are managed under conf/exp.yaml.

To train a visual tokenizer:

python train_tokenizer.py  --exp_name vqgan-stfdogs  --output_path ../outputs  --conf conf/stfdogs.yaml  --wandb visual-gpt

The implemented VectorQuantization layer supports several enhanced dictionary learning techniques. Example programmatic usage:

from vgpt import VectorQuantize

vq_layer = VectorQuantize(
        num_codewords = 1024, # dictionary size K
        embedding_dim = 256, # codeword dim d
        cos_dist = False, # use cosine-distance
        proj_dim = None, # low-dimensional factorization
        random_proj = False, # random projection search
        penalty_weight = 0.25, # penalize non-uniform dists
        pretrain_steps = 5000, # warm-start autoencoders
        init_method = "latent_random" # initialize codebook with pre-trained latents
)
# tokenizing & decoding
code = vq_layer.quantize(z_e)
z_q = vq_layer.dequantize(code)

Training a Visual GPT

After obtaining a fine-grained visual tokenizer, a visual language model can be easily trained for as many generative tasks as possible. To train a class-conditioned image synthesizer based on flattened image tokens:

python train_gpt.py --exp_name vqgan-stfdogs  --output_path ../outputs  --tokenizer_path ../outputs/vqgan-stfdogs  --conf conf/stfdogs.yaml  --wandb visual-gpt

The visual autoregressive model is implemented as CondVisualGPT, which is heavily relied on HuggingFace-Transformers🤗. It handles language model training and sampling very conveniently.

Results

Here are additional generated samples:

Acknowledgements

This repo is heavily inspired by these papers:

@misc{esser2021taming,
  title={Taming transformers for high-resolution image synthesis},
  author={Esser, Patrick and Rombach, Robin and Ommer, Bjorn},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages={12873--12883},
  year={2021}
}
@misc{huh2023straightening,
  title={Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks},
  author={Huh, Minyoung and Cheung, Brian and Agrawal, Pulkit and Isola, Phillip},
  booktitle={International Conference on Machine Learning},
  pages={14096--14113},
  year={2023},
}

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
assets		assets
conf		conf
dataset		dataset
scripts		scripts
vgpt		vgpt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
sample.py		sample.py
train_gpt.py		train_gpt.py
train_tokenizer.py		train_tokenizer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Visual-GPT

Usage

Setup

Synthesize Your Dog Images!

Training a Visual Tokenizer

Training a Visual GPT

Results

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

yzGuu830/visual-gpt

Folders and files

Latest commit

History

Repository files navigation

Visual-GPT

Usage

Setup

Synthesize Your Dog Images!

Training a Visual Tokenizer

Training a Visual GPT

Results

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages