SDXL Diffusion Model Training - Style & Objects
SDXL Diffusion Model Training - Style & Objects
Fine-Tuning
Prepared By: Adam Łucek
Video Walkthrough
On Style On Objects
https://towardsdatascience.com/the-arrival-of-sdxl-1-0-4e739d5cc6c7
Prompts: Inputs to the model consisting of a "Positive String" and a "Negative String", which
guide the model on what to include or avoid in the generated image.
Text Encoder I & II: These are different encoders (e.g., OpenCLIP-VT/G and CLIP-VT/L)
used to transform the text inputs into embeddings. The embeddings from these encoders are
then concatenated to form a rich representation of the text input.
Refiner Text Embeddings: A refinement step that processes the concatenated embeddings
to optimize or enhance the information they contain, making them more suitable for generating
conditioned latents.
Seed & Gaussian Noise: Initial random inputs that, along with text embeddings, help in
generating the initial latent representations of the image.
Base Model Text Conditioned UNeT: A UNeT-based model (a type of convolutional neural
network that follows a U-shaped architecture, with an encoder for downsampling to capture
context and a decoder for upsampling to precisely localize features) that takes the initial latents
and the refined text embeddings to generate an initial version of the conditioned latents. This
step includes a "Reconstruct Scheduler" that determines how the latent space is iteratively
refined across several steps.
Refiner Model Text Conditioned UNeT: An additional refinement stage using a UNeT
model that further processes the conditioned latents to enhance the final image output, involving
multiple iterations as governed by another "Reconstruct Scheduler".
VAE Decoder: A variational autoencoder (VAE) decoder that converts the final conditioned
latents into the pixel space, resulting in the generation of the final output image, typically at a
high resolution like 1024x1024.
For our fine tuning efforts, we will be focusing on modifying three parts of this architecture to get
our desired results
By modifying just these parts, we’ll be able to get noticeable results when training on our own
dataset!
https://towardsdatascience.com/you-cant-spell-diffusion-without-u-60635f569579
The fine tuning process is mostly focused on training the UNeT, which in our case will be the
base model shown in the diagram above. The overall process for training the entire UNeT tends
to follow a process similar to this:
1. Loading Pre-Trained Model: As we’re fine tuning the model, we want to re-use what
components we can. In the case of SDXL, we want to load the main base UNeT,
tokenizers, text encoders I & II, and the variational autoencoder.
2. Pre-Process Data: The training data will be two parts, an image and a corresponding
caption. We want to ensure that the data is good to go through this process, so it first
undergoes some pre-processing. This includes:
a. Resizing, cropping, cutting, and flipping of the images to the desired resolution to
ensure uniformity of training data.
b. Pre-computing the embeddings of both the text and the images.
i. Text data is first tokenized (broken down into pieces, typically words or
subwords), and each token is then transformed into a dense vector
embedding that represents semantic information about the text.
ii. The VAE takes an image and encodes it into a latent space—a
compressed representation that retains the critical features of the image
but in a more abstract form.
iii. Pre-computing these embeddings before training speeds up the overall
training process instead of calculating embeddings on the fly during each
training step.
3. Setting & Loading Hyperparameters: Instantiating all of your training hyperparameters,
like epochs, batch size, learning rate, etc. (or tuning them from prior runs)
a. Sample Noise: Adding noise to the latent space introduces variability and
robustness into the model, helping it to generate diverse and realistic images by
preventing overfitting.
b. Sample Timestep: Sampling timesteps randomly for each image helps simulate
various stages of adding noise, which the model learns to reverse
c. Add Noise to Image Input: The noise added at each timestep simulates the
forward diffusion process, and takes into consideration the magnitude of noise at
the sampled timestep.
d. Predict Noise Residual: The model predicts the difference between the noisy
input and the clean image
e. Calculate Loss: Measures how far the model's prediction of noise is from the
actual noise that was added.
f. Backpropagate: The gradients of the loss are calculated with respect to each
parameter of the model. Using an optimizer like Adam, the model updates its
weights based on the gradients to minimize the loss.
g. Repeat & Validate: Repeating the training process with regular validations helps
monitor the model’s performance on unseen data, ensuring it generalizes well
beyond the training set.
Following these general model training steps, we’re able to perform text-to-image fine tuning
with our own dataset on the entire base UNeT of the overall diffusion model.
Note that this is a process for training the ENTIRE model itself, in practice this can be
expensive, tricky, and compute intensive. For more consumer available methods, we take
advantage of “parameter efficient fine tuning” methods, defined below.
- Overfitting: Training your weights too specifically on the training data, causing the
model to reproduce or simply copy what it's been trained on. The goal for generative
models is to “learn” from the data to generate new data on its own, not to copy the data
itself.
- Catastrophic Forgetting: The tendency for neural networks to abruptly and drastically
forget previously learned information upon learning new information.
To address these issues, researchers at Microsoft developed the Low Rank Adaptation, or
LoRA, fine-tuning method.
a. Frozen Weights: The core architecture of LoRA involves keeping the main
pre-trained model weights unchanged (frozen), ensuring that the fundamental
knowledge the model has learned remains intact.
b. Injectable Modules: Rather than retraining all parameters, LoRA introduces
trainable low-rank matrices into each layer of the Transformer architecture.
These matrices are significantly smaller, reducing the computational load and
simplifying the adaptation process.
a. Full-Rank vs. Low-Rank: Normally, neural networks utilize dense layers with
full-rank weight matrices, which can be computationally intensive to update.
LoRA, however, employs low-rank matrices that capture the essential
transformations required for new tasks, effectively reducing the dimensionality
and complexity of the updates.
b. Reduced Resource Demands: This targeted approach cuts down on the
number of trainable parameters drastically, which lowers memory requirements
and computational costs, making it feasible to adapt large models on less
powerful hardware.
1. LoRA is specifically applied to the projection matrices within the Transformer’s
self-attention mechanism—namely the query, key, value, and output matrices. These
matrices play crucial roles:
a. Query: Determines how much attention to pay to each element in the input
sequence.
c. Value: Contains the actual information from the input data that is retrieved after
computing the attention.
d. Output: Aggregates the weighted contributions from the 'value' matrix to form the
output of the attention layer.
2. Focus on Attention Weights: The adaptation strategy primarily modifies the attention
weights, which are critical for tailoring the model’s responses to specific tasks. The MLP
(multi-layer perceptron) modules remain unchanged to maintain the model's general
capabilities while enhancing task-specific performance.
Supporting Context
1. Transformer:
a. Overview: The Transformer is a type of neural network architecture that has
become the foundation for many state-of-the-art models in natural language
processing (NLP). It was introduced by Vaswani et al. in the paper "Attention is
All You Need" in 2017.
b. Key Features: Unlike previous models that relied heavily on sequence-based
processing (e.g., RNNs and LSTMs), the Transformer uses a mechanism called
self-attention to process all parts of the input data simultaneously. This allows for
significantly improved parallelization during training and leads to better handling
of long-range dependencies in data.
c. Impact: Transformers have revolutionized the field of NLP, enabling the
development of highly effective models like BERT, GPT series, and others that
perform exceptionally well on a wide range of NLP tasks.
2. Attention Mechanism:
a. Purpose: The attention mechanism allows models to focus on different parts of
the input sequence when performing tasks like translation, summarization, or text
generation. This mechanism is integral to the Transformer architecture.
b. How It Works: In the context of Transformers, attention weights are computed
between all pairs of input and output positions. The weights determine how much
each part of the input should be considered for each output, allowing the model
to dynamically prioritize which parts of the input are most relevant.
a. Mathematical Definition: In linear algebra, the rank of a matrix is the maximum
number of linearly independent column vectors in the matrix or, equivalently, the
maximum number of linearly independent row vectors. Rank provides a measure
of the information content of the matrix.
i. Simplified: In the context of a matrix, which you can think of as a grid
filled with numbers, the rank tells you how many different rows (or
horizontal lines of numbers) or columns (vertical lines of numbers) you
really need to keep the essential information in that grid. Each row or
column must add new, unique information that isn't already provided by
the others. So, the rank gives you the smallest number of rows or
columns that are needed to maintain all the information in the matrix
without any redundancy.
Our examples and applications for fine tuning will all take advantage of LoRA fine-tuning!
DreamBooth Fine-Tuning
While Text-to-Image training on the UNeT can be good for initially training a model to generate
an image, or tune it closer towards a style/type/genre of image, it lacks one specific capability:
reliable generation and recreation of specific subjects or objects. The DreamBooth method aims
to take text-to-image training one step further to tackle this problem.
Key Concepts of DreamBooth
Overview: DreamBooth is a method from Google Research that takes text-to-image generation
one step further by allowing personalized image synthesis with only minimal reference images.
It extends traditional large-scale text-to-image models, which, while capable of generating
diverse images from text prompts, generally fail to accurately recreate and recontextualize
specific subjects, or objects.
a. Image-Text Pairing: The model is fine-tuned with text prompts that include the
unique identifier along with a class noun that describes the subject (e.g., “a
[unique_token] dog”, commonly used tokens include: sks, [V], T0K). This pairs
the visual features of the subject with the text-based identifier.
b. Impact: This process embeds the subject into the model’s output domain (set of
possible outputs that the model can produce), enabling it to generate this subject
in various scenarios when prompted with the identifier.
b. Expansion Technique: By training the model on image-text pairs that include
the unique identifiers, DreamBooth effectively “teaches” the model to associate
these new tokens with specific visual characteristics of the subjects.
c. Generation: When generating new images, the model uses these learned
associations to accurately recreate the subject in response to prompts that
include the unique identifier.
3. Implementing Class-Specific Prior Preservation Loss:
a. Implementation: The CS-PPL loss function is designed to balance the model’s
ability to generate the specific subject with its general capability to produce
diverse images from the same class. It operates by comparing the outputs
generated from prompts with and without the unique identifiers. It ensures that
while the model becomes better at generating the specific subject, it does not
forget how to generate other plausible instances from the same class.
b. Purpose: To prevent the model from losing its ability to generate other instances
of the same class/subject (language drift) and to maintain high diversity in the
outputs (no overfitting). The loss adjusts the model's training process to ensure
that the introduction of a unique identifier does not skew the model’s overall
output diversity or its understanding of the class.
Applications of DreamBooth
Art Renditions: Creating artistic interpretations of subjects in the styles of famous painters or
sculptors, offering meaningful variations while preserving subject identity.
Novel View Synthesis: Rendering subjects from new perspectives, extrapolating from limited
views to generate images from unobserved angles.
Property Modification: Altering subject properties such as color or species, blending unique
features with new characteristics in a contextually appropriate manner.
Accessorization: Outfitting subjects with a variety of accessories tailored to specific scenarios,
preserving the subject's core identity.
DreamBooth now allows us to maintain a subject’s likeness during image generation across
multiple applications. Using these tweaks of the Text-to-Image training method, we can apply
this with existing LoRA techniques for subject permanence across scenes during generation!
Description: Training models to generate images that are representative of the distribution
seen in the training data, without being steered by any specific textual or image-based
conditions.
Textual Inversion
Description: Technique to personalize and refine the process of creating specific visual
concepts, such as unique objects or artistic styles. It operates by introducing new "words" into
the embedding space of a pre-trained text-to-image model. These words are learned from a
small set of images (typically 3-5) that depict the concept the user wants to generate. Once
learned, these new embeddings function like any other word in the model's vocabulary, allowing
them to be used in sentences to guide the generation of images that feature the learned
concepts, similar to the DreamBooth method.
Custom Diffusion
Description: Method that enables diffusion models to quickly adapt to and synthesize new
user-defined concepts like personal items, pets, or family members. It utilizes a retrieval system
to gather real images with captions closely matching the new concepts provided by the user.
This collection forms the basis of a specialized training dataset for fine-tuning the model. A
modifier token is prefixed to general category names to denote personal concepts during the
training process.
Description: Denoising Diffusion Policy Optimization (DDPO) treats the denoising process in
diffusion models as a multi-step decision-making problem. DDPO employs policy gradient
algorithms, demonstrating greater efficacy than conventional reward-weighted likelihood
methods. The application of DDPO allows for the refinement of text-to-image diffusion models to
meet specific objectives like enhancing image compressibility and improving image aesthetics
based on human feedback, without the direct reliance on explicit prompting methods. Moreover,
DDPO can utilize feedback from vision-language models, circumventing the need for extensive
data collection or manual human annotation.
T2I-Adapters
Description: Method allowing for more precise adjustments in attributes like color and structure.
It introduces the concept of T2I-Adapters, which are simple, lightweight modules that learn to
align the model's internal knowledge with external control signals. These adapters enable
fine-grained control over the image generation process by allowing the original Text-to-Image
models to remain unchanged while various adapters are trained to cater to different control
conditions. This achieves a more tailored and controlled generation of images.
ControlNet
Description: This architecture allows for the application of various conditioning controls, such
as edges, depth, segmentation, and human poses, which can be used singly or in combination,
with or without accompanying text prompts.
InstructPix2Pix
Description: Method for editing images based on human instructions. The model operates by
receiving an input image alongside written instructions that detail the desired edits. Leveraging
the capabilities of two large pretrained models—a language model (GPT-3) and a text-to-image
model (Stable Diffusion)—InstructPix2Pix uses these resources to generate a substantial
dataset of image-editing examples for training. This training allows the model to generalize
effectively to new, real images and user-provided instructions during inference. Unlike many
other image editing models, InstructPix2Pix executes edits in a single forward pass without the
need for per-example fine-tuning or inversion, resulting in rapid processing times.
Many other approaches and concepts are out there, but these are all documented well for
individuals using the HuggingFace Diffusers package, which we will dig into next!
1. Pipelines: A simple way to run state-of-the-art diffusion models in inference by bundling
all of the necessary components (multiple independently-trained models, schedulers,
and processors) into a single end-to-end class. Pipelines are flexible and they can be
adapted to use different schedulers or even model components.
2. Noise Schedulers: Interchangeable noise schedulers for balancing trade-offs between
generation speed and quality.
a. A scheduler takes a model’s output (the sample which the diffusion process is
iterating on) and a timestep to return a denoised sample. The timestep is
important because it dictates where in the diffusion process the step is; data is
generated by iterating forward n timesteps and inference occurs by propagating
backward through the timesteps. Based on the timestep, a scheduler may be
discrete in which case the timestep is an int or continuous in which case the
timestep is a float.
3. Pretrained Models: Pretrained models that can be used as building blocks, and
combined with schedulers, for creating your own end-to-end diffusion systems.
These all allow for great flexibility over building, optimizing, and inferencing diffusion models of
all modalities, both automatically with bundled pipelines, or piece by piece in a building block
style.
They also provide great tools, and premade scripts, for training diffusion models, as we have
discussed in the prior informational sections.
Below, we will break down the setup and execution of two premade scripts, Text-to-Image LoRA
Training, and DreamBooth LoRA training for SDXL Base 1.0.
README
🤗
copy of a dataset in your filesystem, or to a folder
containing files that Datasets can understand.
README
NOTE: The following tables show the differences between this script and the prior. Existing
arguments from the above Text-To-Image script remain true, along with these new arguments,
unless specified removed in the Removed Arguments table.
New Arguments
Script Argument Conditions
--instance_data_dir Value Type: String
Default Value: None
Description: A folder in the filesystem containing the
training data.
Breakdown of Arguments:
--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0
Uses the Stable Diffusion XL Base 1.0 model from Stability AI as the starting point for training.
This pretrained model provides a solid foundation, leveraging previously learned features to
improve training efficiency and outcomes.
--pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix"
Uses a specific VAE (Variational Autoencoder) model for image encoding/decoding, which helps
in improving image quality and reducing artifacts during training, especially with mixed precision
(fp16).
--dataset_name="AdamLucek/oldbookillustrations-small"
Specifies a dataset of old book illustrations for training. This dataset defines the visual style and
content that the model will learn to generate.
--num_validation_images=4
Generates 4 images during each validation phase. This provides a small sample to assess the
model’s performance without significantly impacting training time.
--validation_epochs=1
Runs the validation process after every epoch. This frequent validation helps in monitoring the
model's progress and identifying potential issues early.
--output_dir="output/sdxl-base-1.0-oldbookillustrations-lora"
Specifies the directory where all training outputs, including models and logs, will be saved. This
keeps training results organized and easily accessible.
--resolution=1024
Sets the resolution of the training and validation images to 1024x1024 pixels. Higher resolution
improves image detail but requires more computational resources.
--center_crop
Crops images to the center before training, ensuring uniformity in image size and focusing on the
central part of the images, which is often the most relevant.
--random_flip
Applies random horizontal flips to training images, augmenting the dataset and helping the model
generalize better by learning from more varied image orientations.
--train_text_encoder
Enables training of the text encoder along with the image generator. This improves the model’s
ability to understand and generate images based on textual descriptions.
--train_batch_size=1
Processes one image per batch per device during training. Smaller batch sizes can reduce
memory usage but may slow down the training process.
--num_train_epochs=10
Trains the model for 10 epochs, meaning the entire dataset will be passed through the model 10
times. More epochs can improve learning but require more time.
--checkpointing_steps=500
Saves a checkpoint of the model every 500 steps. This allows for saving intermediate states,
making it possible to resume training or rollback if needed.
--gradient_accumulation_steps=4
Accumulates gradients over 4 steps before performing a backward pass. This effectively
simulates a larger batch size, helping to stabilize training without requiring more memory.
--learning_rate=1e-04
Sets the initial learning rate to 0.0001. The learning rate controls how much to adjust the model’s
weights with respect to the loss gradient. A smaller learning rate can lead to more stable but
slower convergence.
--lr_warmup_steps=0
No warm-up period for the learning rate. Warm-up steps can help prevent large updates early in
training, which can destabilize the model.
--report_to="wandb"
Uses Weights & Biases for logging and monitoring the training process. This provides
visualization tools and tracking capabilities for better experiment management.
--dataloader_num_workers=8
Uses 8 subprocesses for data loading. More workers can speed up data loading and
preprocessing, reducing bottlenecks and improving training efficiency.
--allow_tf32
Enables TF32 computations on Ampere GPUs, potentially speeding up training by allowing lower
precision calculations while maintaining acceptable accuracy.
--mixed_precision="fp16"
Uses 16-bit floating point precision for training, which reduces memory usage and can speed up
training on compatible hardware without significantly affecting model accuracy.
--push_to_hub
Pushes the trained model to the Hugging Face Hub, making it accessible for sharing and
deployment. This integrates the model into a broader ecosystem for collaboration.
--hub_model_id="sdxl-base-1.0-oldbookillustrations-lora"
Sets the identifier for the model on the Hugging Face Hub, organizing it under a specific name
and version for easy access and management.
Breakdown of Arguments:
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0"
Utilizes the Stable Diffusion XL Base 1.0 model from Stability AI as the starting point for training.
This pretrained model offers a robust foundation, leveraging prior knowledge to enhance training
efficiency and outcomes.
--dataset_name="AdamLucek/green-chair"
Specifies the dataset of green chair images for training. This dataset will determine the specific
visual features and style that the model will learn to replicate.
--pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix"
Uses a particular VAE (Variational Autoencoder) model for encoding and decoding images, which
helps improve image quality and reduce artifacts, especially when using mixed precision (fp16).
--output_dir="lora-trained-xl"
Sets the directory where all training outputs, including models and logs, will be saved. This keeps
the training results organized and easily accessible.
--train_text_encoder
Enables the training of the text encoder alongside the image generator. This improves the
model’s ability to understand and generate images based on textual descriptions.
--train_batch_size=1
Processes one image per batch per device during training. Smaller batch sizes can reduce
memory usage but may slow down the training process.
--gradient_accumulation_steps=4
Accumulates gradients over 4 steps before performing a backward pass. This effectively
simulates a larger batch size, helping to stabilize training without requiring more memory.
--learning_rate=1e-4
Sets the initial learning rate to 0.0001. The learning rate controls the magnitude of updates to the
model’s weights. A smaller learning rate can lead to more stable but slower convergence.
--lr_scheduler="constant"
Uses a constant learning rate throughout the training process, which helps maintain consistent
updates to the model's weights without adjustments.
--lr_warmup_steps=0
No warm-up period for the learning rate. Warm-up steps can help prevent large updates early in
training, which can destabilize the model.
--max_train_steps=500
Limits the training process to a maximum of 500 steps. This cap helps to prevent overfitting and
manage computational resources effectively.
--validation_epochs=5
Runs the validation process every 5 epochs. Frequent validation helps monitor the model's
progress and identify potential issues early.
--seed="0"
Sets the random seed to 0 for reproducibility. This ensures that the training process can be
replicated exactly, which is important for debugging and verifying results.
--hub_model_id="sdxl-base-1.0-greenchair-dreambooth-lora"
Assigns an identifier for the model on the Hugging Face Hub. This organizes the model under a
specific name and version for easy access and management.
--push_to_hub
Pushes the trained model to the Hugging Face Hub, making it accessible for sharing and
deployment. This integrates the model into a broader ecosystem for collaboration.
Running A Script Example
The Setup
Pick your favorite cloud computing service, and snag a GPU. Here’s some popular spots in no
particular order or endorsement (someone please sponsor me):
Runpod - https://www.runpod.io/
Lambda Labs - https://lambdalabs.com/
Tensordock - https://www.tensordock.com/
Hyperstack - https://www.hyperstack.cloud/
Vast - https://vast.ai/
Coreweave - https://www.coreweave.com/
Paperspace - https://www.paperspace.com/
Google Cloud Platform - https://cloud.google.com/?hl=en
Amazon Web Service - https://aws.amazon.com/
Microsoft Azure - https://azure.microsoft.com/en-us
Nonexhaustive list, find the option that works for your budget and GPU requirements, for
example the two scripts were trained on 1x A100 each, with 40GB of VRAM. You can skip
validation inference runs to free up memory, and other parameters that can be added/modified
for lesser systems are:
Connect to your environment remotely and navigate to the command line interface
Step 2:
cd examples/text_to_image
Step 3:
Log into HuggingFace, and Weights and Biases if using for logging/tracking
huggingface-cli login
Step 4:
Step 5:
Run the script! Make sure to update the output directory and model ID if copying the script
below to your own.
As always, tuning your hyperparameters and figuring out what training method works best on
your system or setup is nothing short of an art. Tinker around, see what error codes get thrown,
monitor your training progress, and eventually you’ll have a model training the way you want.
Step 5:
Load and use your model! The above training script took roughly 3 and a half hours to fully train
on my dataset. In the next section, we have notebooks outlining the code used to load and run
inference with the trained model.
HuggingFace Hub:
AdamLucek/sdxl-base-1.0-oldbookillustrations-lora
Dataset:
AdamLucek/oldbookillustrations-small
Example Results:
“A dachshund walks confidently down a dirt path”
HuggingFace Hub:
AdamLucek/sdxl-base-1.0-greenchair-dreambooth-lora
Dataset:
AdamLucek/green-chair
Reference Image:
Example Results:
“A photo of sks chair in front of the great pyramids”
“A photo of sks chair in the middle of a new england fall field, with autumn leaves all around and
pumpkins”