Text-to-Image Generation Using CLIP and U-Net

Author: Jay Shah

Date: 04-26-2024

This project implements a text-to-image generation model using a pre-trained CLIP model for text encoding and a simple U-Net architecture for image generation. The model learns to generate images based on textual descriptions by leveraging diffusion-based denoising techniques.

Introduction

This project utilizes:

CLIP Model: To encode text into feature vectors.
U-Net Architecture: To generate images by first up-sampling and then down-sampling the input.
Diffusion Model: To progressively refine the generated image over multiple iterations.

Dataset

The dataset consists of images labeled with different text descriptions of geometric shapes in different arrangements. The images are stored in DATA_DIR, and each label represents a different spatial configuration of shapes.

To access the dataset, email me at [email protected].

Preprocessing

The dataset is preprocessed by:

Loading images and filtering non-image files.
Applying transformations such as resizing, flipping, and normalizing pixel values.
Converting images to tensors for PyTorch processing.

Model Architecture

The model consists of:

Text Encoder: A linear layer projecting CLIP-generated text features.
U-Net Architecture: A simplified U-Net with residual connections to preserve image details.
Time Embedding: To incorporate timestep information into the model.

Training

Optimizer: Adam optimizer with a learning rate of 0.001.
Loss Function: L1 loss between predicted and actual noise in diffusion.
Training Duration: 100 epochs with periodic loss logging.

Generating Images

To generate images from text:

text = "square on top of circle"
generate_text_to_image_samples(text)

This runs the model to iteratively refine an image based on the input text prompt.

Installation

Prerequisites:

Python 3.x
PyTorch
NumPy
Matplotlib
Pillow (PIL)
torchvision
OpenAI CLIP

Installation Steps:

pip install torch torchvision numpy matplotlib pillow clip-by-openai

Usage

Clone the repository:

git clone <repo_link>
cd <repo_folder>

Update DATA_DIR with the path to your dataset.
Run the training script:

python train.py

Generate images:

python generate.py

Results

The model progressively generates better images over 100 epochs. Sample results can be visualized using:

sample_plot_image()

Contact

For dataset access or inquiries, email me at:
📧 [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
project1.ipynb		project1.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-to-Image Generation Using CLIP and U-Net

Author: Jay Shah

Date: 04-26-2024

Table of Contents

Introduction

Dataset

Preprocessing

Model Architecture

Training

Generating Images

Installation

Prerequisites:

Installation Steps:

Usage

Results

Contact

About

Uh oh!

Releases

Packages

Languages

License

jayshah1819/ShapeGen3D

Folders and files

Latest commit

History

Repository files navigation

Text-to-Image Generation Using CLIP and U-Net

Author: Jay Shah

Date: 04-26-2024

Table of Contents

Introduction

Dataset

Preprocessing

Model Architecture

Training

Generating Images

Installation

Prerequisites:

Installation Steps:

Usage

Results

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages