Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Hierarchical image colorization model combining a Swin Transformer encoder with an EMA-based VQ-VAE bottleneck and a residual decoder. Learns discrete color representations and produces realistic, perceptually consistent colorizations

License

Notifications You must be signed in to change notification settings

navaneet625/SwinVQColor

Repository files navigation

🌈 SwinVQColor: Hierarchical VQ-VAE with Swin Transformer for Image Colorization

This repository implements a Swin Transformer-based Vector Quantized Variational Autoencoder (VQ-VAE) for automatic image colorization in the CIE Lab color space.
Given a grayscale (L-channel) input, the model predicts the chrominance channels (ab_pred) using a learned discrete latent representation, producing rich and perceptually realistic colorizations.


🧠 Motivation

Traditional CNN-based colorization models often produce blurry and desaturated results due to regression to the mean.
To overcome this, we combine:

  • Swin Transformer Encoder → captures hierarchical structure and context.
  • VQ-VAE with EMA Codebook → learns a discrete, compressed color representation.
  • Perceptual and Color Losses → ensure sharp, realistic, and color-faithful reconstructions.

This fusion leverages the representational power of transformers with the stability and interpretability of vector quantization.


🧩 Model Overview

Pipeline

Key Components

Module Description
Swin Encoder A hierarchical Vision Transformer backbone that extracts multi-scale spatial features from the grayscale input.
VQ-VAE (EMA) Quantizes the encoder’s latent features into a discrete latent space using Exponential Moving Average (EMA) codebook updates.
Decoder A lightweight CNN/Transformer-based upsampler that reconstructs the chrominance (ab) channels from quantized embeddings.
Loss Functions Combines pixel-wise, perceptual, vector-quantization, and color consistency terms to guide training.

⚙️ Architecture Details

1️⃣ Swin Encoder

  • Input: single-channel L (1×H×W)
  • Backbone: Swin-T/S/B variant
  • Output: hierarchical feature map z_enc (latent embedding)

2️⃣ Vector Quantization (VQ-VAE, EMA)

  • Discretizes latent space using a codebook of learned embeddings.
  • Codebook updated via Exponential Moving Average for stability.
  • Outputs:
    • z_q: quantized latent vectors
    • vq_loss: commitment + embedding loss

3️⃣ Decoder

  • Maps quantized latent z_q back to chrominance space (ab_pred).
  • Uses upsampling, residual blocks, and attention.
  • Output: predicted ab channels (2×H×W)

Training Command

python train.py --config configs/swinvq_color.yaml


📘 References

VQ-VAE: Neural Discrete Representation Learning (Oord et al., 2017)

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (Liu et al., 2021)

VQ-VAE-2: Generating Diverse High-Fidelity Images (Razavi et al., 2019)

Colorful Image Colorization (Zhang et al., 2016)


## 🧩 License
This project is licensed under the [MIT License](LICENSE) — see the LICENSE file for details.


About

Hierarchical image colorization model combining a Swin Transformer encoder with an EMA-based VQ-VAE bottleneck and a residual decoder. Learns discrete color representations and produces realistic, perceptually consistent colorizations

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published