🌈 SwinVQColor: Hierarchical VQ-VAE with Swin Transformer for Image Colorization

This repository implements a Swin Transformer-based Vector Quantized Variational Autoencoder (VQ-VAE) for automatic image colorization in the CIE Lab color space.
Given a grayscale (L-channel) input, the model predicts the chrominance channels (ab_pred) using a learned discrete latent representation, producing rich and perceptually realistic colorizations.

🧠 Motivation

Traditional CNN-based colorization models often produce blurry and desaturated results due to regression to the mean.
To overcome this, we combine:

Swin Transformer Encoder → captures hierarchical structure and context.
VQ-VAE with EMA Codebook → learns a discrete, compressed color representation.
Perceptual and Color Losses → ensure sharp, realistic, and color-faithful reconstructions.

This fusion leverages the representational power of transformers with the stability and interpretability of vector quantization.

🧩 Model Overview

Pipeline

Key Components

Module	Description
Swin Encoder	A hierarchical Vision Transformer backbone that extracts multi-scale spatial features from the grayscale input.
VQ-VAE (EMA)	Quantizes the encoder’s latent features into a discrete latent space using Exponential Moving Average (EMA) codebook updates.
Decoder	A lightweight CNN/Transformer-based upsampler that reconstructs the chrominance (`ab`) channels from quantized embeddings.
Loss Functions	Combines pixel-wise, perceptual, vector-quantization, and color consistency terms to guide training.

⚙️ Architecture Details

1️⃣ Swin Encoder

Input: single-channel L (1×H×W)
Backbone: Swin-T/S/B variant
Output: hierarchical feature map z_enc (latent embedding)

2️⃣ Vector Quantization (VQ-VAE, EMA)

Discretizes latent space using a codebook of learned embeddings.
Codebook updated via Exponential Moving Average for stability.
Outputs:
- z_q: quantized latent vectors
- vq_loss: commitment + embedding loss

3️⃣ Decoder

Maps quantized latent z_q back to chrominance space (ab_pred).
Uses upsampling, residual blocks, and attention.
Output: predicted ab channels (2×H×W)

Training Command

python train.py --config configs/swinvq_color.yaml


📘 References

VQ-VAE: Neural Discrete Representation Learning (Oord et al., 2017)

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (Liu et al., 2021)

VQ-VAE-2: Generating Diverse High-Fidelity Images (Razavi et al., 2019)

Colorful Image Colorization (Zhang et al., 2016)


## 🧩 License
This project is licensed under the [MIT License](LICENSE) — see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
models		models
LICENCE		LICENCE
README.md		README.md
config.py		config.py
inference.py		inference.py
losses.py		losses.py
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py
utils.py		utils.py
view_results.py		view_results.py
vitcolor.ipynb		vitcolor.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌈 SwinVQColor: Hierarchical VQ-VAE with Swin Transformer for Image Colorization

🧠 Motivation

🧩 Model Overview

Pipeline

Key Components

⚙️ Architecture Details

1️⃣ Swin Encoder

2️⃣ Vector Quantization (VQ-VAE, EMA)

3️⃣ Decoder

Training Command

About

Uh oh!

Releases

Packages

Languages

License

navaneet625/SwinVQColor

Folders and files

Latest commit

History

Repository files navigation

🌈 SwinVQColor: Hierarchical VQ-VAE with Swin Transformer for Image Colorization

🧠 Motivation

🧩 Model Overview

Pipeline

Key Components

⚙️ Architecture Details

1️⃣ Swin Encoder

2️⃣ Vector Quantization (VQ-VAE, EMA)

3️⃣ Decoder

Training Command

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages