Add ddim inversion pix2pix #2397

patrickvonplaten · 2023-02-17T12:11:01Z

Pix2Pix0: Generate Caption -> Invert -> Generate Image:

import torch
from transformers import BlipForConditionalGeneration, BlipProcessor
from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionPix2PixZeroPipeline
import requests
from PIL import Image

captioner_id = "Salesforce/blip-image-captioning-base"
processor = BlipProcessor.from_pretrained(captioner_id)
model = BlipForConditionalGeneration.from_pretrained(captioner_id, torch_dtype=torch.float16, low_cpu_mem_usage=True)
sd_model_ckpt = "CompVis/stable-diffusion-v1-4"
pipeline = StableDiffusionPix2PixZeroPipeline.from_pretrained(
    sd_model_ckpt,
    caption_generator=model,
    caption_processor=processor,
    torch_dtype=torch.float16,
    safety_checker=None,
)
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
pipeline.enable_model_cpu_offload()

img_url = "https://github.com/pix2pixzero/pix2pix-zero/raw/main/assets/test_images/cats/cat_6.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB").resize((512, 512))
caption = pipeline.generate_caption(raw_image)

generator = torch.manual_seed(0)
inv_latents = pipeline.invert(caption, image=raw_image, generator=generator).latents

# See the "Generating source and target embeddings" section below to
# automate the generation of these captions with a pre-trained model like Flan-T5.
# automate the generation of these captions with a pre-trained model like Flan-T5 as explained below.
source_prompts = ["a cat sitting on the street", "a cat playing in the field", "a face of a cat"]
target_prompts = ["a dog sitting on the street", "a dog playing in the field", "a face of a dog"]
source_embeds = pipeline.get_embeds(source_prompts, batch_size=2)
target_embeds = pipeline.get_embeds(target_prompts, batch_size=2)
image = pipeline(
    caption,
    source_embeds=source_embeds,
    target_embeds=target_embeds,
    num_inference_steps=50,
    cross_attention_guidance_amount=0.15,
    generator=generator,
    latents=inv_latents,
    negative_prompt=caption,
).images[0]
image.save("edited_image.png")

Source image:

Generated image:

HuggingFaceDocBuilderDev · 2023-02-17T12:15:39Z

The documentation is not available anymore as the PR was closed or merged.

sayakpaul · 2023-02-17T12:27:43Z

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_pix2pix_zero.py

+        device = torch.device(f"cuda:{gpu_id}")
+
+        hook = None
+        for cpu_offloaded_model in [self.vae, self.text_encoder, self.unet, self.vae]:


Repetition in the self.vae. You probably meant self.captioner?

That's actually on purpose so that the first self.vae is offloaded when the text encoder is called 😅

cc @pcuenca this should work no? Since inversion is img2img

The hook set up for the first self.vae will be replaced by the one added last, which unloads the unet (which is not loaded by the time the first self.vae is called, so it should be ok). It's a bit confusing though 😅.

Suggested change

for cpu_offloaded_model in [self.vae, self.text_encoder, self.unet, self.vae]:

# `vae` added twice to ensure it unloads when the `text_encoder` is used

for cpu_offloaded_model in [self.vae, self.text_encoder, self.unet, self.vae]:

Another option is to offload the vae manually whenever we use it.

Could you elaborate this a bit? Didn't really understand it.

Hmm, no worries. I will sync up with you offline on this next week.

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_pix2pix_zero.py

src/diffusers/schedulers/scheduling_ddim_inverse.py

sayakpaul

LOVE the design 🔥

Let's make sure we have coverage in totality from the docs.

…add_ddim_inversion_pix2pix

src/diffusers/schedulers/scheduling_ddim_inverse.py

…ngface/diffusers into add_ddim_inversion_pix2pix

patrickvonplaten · 2023-02-17T14:22:52Z

No time for DDIM Inversion tests will add them later: #2399

patil-suraj

Looks great. Thanks a lot for adding this so quickly! Mostly left some nits. My main comment is: I'm not sure if we need to add a completely new scheduler for just doing the inverse step. Another option would be to add inverse_step method to the scheduler.

docs/source/en/_toctree.yml

docs/source/en/api/schedulers/ddim_inverse.mdx

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_pix2pix_zero.py

src/diffusers/schedulers/scheduling_ddim_inverse.py

patil-suraj · 2023-02-17T14:40:22Z

src/diffusers/schedulers/scheduling_ddim_inverse.py

+class DDIMInverseScheduler(SchedulerMixin, ConfigMixin):
+    """
+    DDIMInverseScheduler is the reverse scheduler of [`DDIMScheduler`].
+
+    [`~ConfigMixin`] takes care of storing all config attributes that are passed in the scheduler's `__init__`
+    function, such as `num_train_timesteps`. They can be accessed via `scheduler.config.num_train_timesteps`.
+    [`SchedulerMixin`] provides general loading and saving functionality via the [`SchedulerMixin.save_pretrained`] and
+    [`~SchedulerMixin.from_pretrained`] functions.
+
+    For more details, see the original paper: https://arxiv.org/abs/2010.02502
+
+    Args:
+        num_train_timesteps (`int`): number of diffusion steps used to train the model.
+        beta_start (`float`): the starting `beta` value of inference.
+        beta_end (`float`): the final `beta` value.
+        beta_schedule (`str`):
+            the beta schedule, a mapping from a beta range to a sequence of betas for stepping the model. Choose from
+            `linear`, `scaled_linear`, or `squaredcos_cap_v2`.
+        trained_betas (`np.ndarray`, optional):
+            option to pass an array of betas directly to the constructor to bypass `beta_start`, `beta_end` etc.
+        clip_sample (`bool`, default `True`):
+            option to clip predicted sample between -1 and 1 for numerical stability.
+        set_alpha_to_one (`bool`, default `True`):
+            each diffusion step uses the value of alphas product at that step and at the previous one. For the final
+            step there is no previous alpha. When this option is `True` the previous alpha product is fixed to `1`,
+            otherwise it uses the value of alpha at step 0.
+        steps_offset (`int`, default `0`):
+            an offset added to the inference steps. You can use a combination of `offset=1` and
+            `set_alpha_to_one=False`, to make the last step use step 0 for the previous alpha product, as done in
+            stable diffusion.
+        prediction_type (`str`, default `epsilon`, optional):
+            prediction type of the scheduler function, one of `epsilon` (predicting the noise of the diffusion
+            process), `sample` (directly predicting the noisy sample`) or `v_prediction` (see section 2.4
+            https://imagen.research.google/video/paper.pdf)
+    """
+
+    order = 1
+
+    @register_to_config
+    def __init__(
+        self,


This looks good, but to be honest I don't think we should have a new scheduler for this. This is not really a new scheduler, just that the step is inverted. What do we think about adding a method called inverse_step?
#2328 (comment)

So, @patrickvonplaten and I had talked a bit about it yesterday and we both agreed that having it in a seperate scheduler is helpful in terms of a simpler API.

If we do inverse _step() then there is a slight disconnect from the original DDIM paper that didn't have anything for inversion. Since we try to be one with the paper literature, I think it makes sense to have a separate scheduler for this as well.

Co-authored-by: Suraj Patil <[email protected]>

…ngface/diffusers into add_ddim_inversion_pix2pix

docs/source/en/api/schedulers/ddim_inverse.mdx

src/diffusers/schedulers/scheduling_ddim_inverse.py

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_pix2pix_zero.py

pcuenca · 2023-02-17T14:51:12Z

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_pix2pix_zero.py

+        device = torch.device(f"cuda:{gpu_id}")
+
+        hook = None
+        for cpu_offloaded_model in [self.vae, self.text_encoder, self.unet, self.vae]:


The hook set up for the first self.vae will be replaced by the one added last, which unloads the unet (which is not loaded by the time the first self.vae is called, so it should be ok). It's a bit confusing though 😅.

Suggested change

for cpu_offloaded_model in [self.vae, self.text_encoder, self.unet, self.vae]:

# `vae` added twice to ensure it unloads when the `text_encoder` is used

for cpu_offloaded_model in [self.vae, self.text_encoder, self.unet, self.vae]:

Another option is to offload the vae manually whenever we use it.

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_pix2pix_zero.py

src/diffusers/schedulers/scheduling_ddim_inverse.py

Co-authored-by: Pedro Cuenca <[email protected]>

…ngface/diffusers into add_ddim_inversion_pix2pix

neverix · 2023-02-28T19:41:42Z

Nice, basically a modern version of #702

* add * finish * add tests * add tests * up * up * pull from main * uP * Apply suggestions from code review * finish * Update docs/source/en/_toctree.yml Co-authored-by: Suraj Patil <[email protected]> * finish * clean docs * next * next * Apply suggestions from code review Co-authored-by: Pedro Cuenca <[email protected]> * up * up --------- Co-authored-by: Suraj Patil <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]>

patrickvonplaten added 2 commits February 17, 2023 11:06

add

9270da7

finish

3e807c8