This repository implements a Swin Transformer-based Vector Quantized Variational Autoencoder (VQ-VAE) for automatic image colorization in the CIE Lab color space.
Given a grayscale (L-channel) input, the model predicts the chrominance channels (ab_pred) using a learned discrete latent representation, producing rich and perceptually realistic colorizations.
Traditional CNN-based colorization models often produce blurry and desaturated results due to regression to the mean.
To overcome this, we combine:
- Swin Transformer Encoder → captures hierarchical structure and context.
- VQ-VAE with EMA Codebook → learns a discrete, compressed color representation.
- Perceptual and Color Losses → ensure sharp, realistic, and color-faithful reconstructions.
This fusion leverages the representational power of transformers with the stability and interpretability of vector quantization.
| Module | Description |
|---|---|
| Swin Encoder | A hierarchical Vision Transformer backbone that extracts multi-scale spatial features from the grayscale input. |
| VQ-VAE (EMA) | Quantizes the encoder’s latent features into a discrete latent space using Exponential Moving Average (EMA) codebook updates. |
| Decoder | A lightweight CNN/Transformer-based upsampler that reconstructs the chrominance (ab) channels from quantized embeddings. |
| Loss Functions | Combines pixel-wise, perceptual, vector-quantization, and color consistency terms to guide training. |
- Input: single-channel L (1×H×W)
- Backbone: Swin-T/S/B variant
- Output: hierarchical feature map
z_enc(latent embedding)
- Discretizes latent space using a codebook of learned embeddings.
- Codebook updated via Exponential Moving Average for stability.
- Outputs:
z_q: quantized latent vectorsvq_loss: commitment + embedding loss
- Maps quantized latent
z_qback to chrominance space (ab_pred). - Uses upsampling, residual blocks, and attention.
- Output: predicted ab channels (2×H×W)
python train.py --config configs/swinvq_color.yaml
📘 References
VQ-VAE: Neural Discrete Representation Learning (Oord et al., 2017)
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (Liu et al., 2021)
VQ-VAE-2: Generating Diverse High-Fidelity Images (Razavi et al., 2019)
Colorful Image Colorization (Zhang et al., 2016)
## 🧩 License
This project is licensed under the [MIT License](LICENSE) — see the LICENSE file for details.