Add GLIGEN Text Image implementation #4777

tuanh123789 · 2023-08-25T09:18:47Z

What does this PR do?

In #4441, there is a mention of the GLIGEN model, but I noticed that it doesn't include all the models referenced in the paper. In this Pull Request, I propose adding two additional models: Generation and Inpainting, with inputs including Text, Box, and particularly Image. With this model, users can input any object without the need to use Textual Inversion, Dreambooth, or LoRA.
All two checkpoint is converted from offical weight.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

tuanh123789 · 2023-08-25T09:31:12Z

@patrickvonplaten @sayakpaul @stevhliu
Could anyone please review this. Thank you !

tuanh123789 · 2023-08-25T09:32:47Z

Model checkpoint available on huggingface
Inpainting: anhnct/Gligen_Inpainting_Text_Image
Generation: anhnct/Gligen_Text_Image

sayakpaul · 2023-08-25T10:37:56Z

Cc: @nikhil-masterful

tuanh123789 · 2023-08-25T17:10:44Z

Can you check @nikhil-masterful

nikhil-masterful · 2023-08-25T18:16:55Z

I'm a little under the weather right now, but will take a look over the weekend

patrickvonplaten · 2023-08-25T18:53:22Z

Hey @tuanh123789 ,

Thanks a lot for your PR! Can we maybe put this pipeline in the community folder: https://github.com/huggingface/diffusers/tree/main/examples/community?

We're sadly a bit overwhelmed by maintenance and aren't able to keep our slow tests clean - if we put this pipeline in the community folder, you can be the official author and the burden to maintain the pipeline is loosened. Would that be ok for you?

tuanh123789 · 2023-08-25T20:44:25Z

@patrickvonplaten
Thank you for responding. I understand everyone is busy, and it's okay to place this pipeline in the community folder. However, there's an issue: I'm not the author of this idea. I merely adapted the code and model weights from the original repository into a pipeline within Diffusers. Since this idea is based on GLIGEN paper (https://arxiv.org/abs/2301.07093), similar to the approach in #4441, I believe it would be more accurate if we put it in https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/stable_diffusion. What are your thoughts on this?

sayakpaul · 2023-08-26T01:40:29Z

Which use cases of GLIGEN are we missing?

tuanh123789 · 2023-08-26T03:50:31Z

@sayakpaul
In the GLIGEN paper, three main method types are mentioned:

Input box location, prompt, and phrases for the objects we want to appear inside the box.
Input box location, prompt, phrases, and an image of the specific object we want to appear inside the box.
Input canny map, semantic map, depth map, and prompt (quite similar to Controlnet).

In issue Add GLIGEN implementation #4441, @nikhil-masterful implemented the method mentioned in type 1 above. In this PR, I implemented the method from type 2. The advantage of this method is that you can place a real-world object or any image style into an image without needing methods like Textual Inversion, Dreambooth, or LoRA, you just need only reference image of each object or style. Additionally, you can introduce multiple objects into the image simultaneously and control where they appear by providing box location. It's important to note that methods 1 and 2 are separate pipelines, and PR Add GLIGEN implementation #4441 is not missing.

Which use cases of GLIGEN are we missing?

sayakpaul · 2023-08-26T04:56:21Z

The advantage of this method is that you can place a real-world object or any image style into an image without needing methods like Textual Inversion, Dreambooth, or LoRA, you just need only reference image of each object or style. Additionally, you can introduce multiple objects into the image simultaneously and control where they appear by providing box location.

This is very interesting. Could you provide a couple of samples for us to see and realize this advantage here? @patrickvonplaten considering this feature of some sort of zero-shot subject driven generation, I'd be in favor of graduating this pipeline in the core.

HuggingFaceDocBuilderDev · 2023-08-26T05:05:48Z

The documentation is not available anymore as the PR was closed or merged.

tuanh123789 · 2023-08-26T07:16:15Z

The advantage of this method is that you can place a real-world object or any image style into an image without needing methods like Textual Inversion, Dreambooth, or LoRA, you just need only reference image of each object or style. Additionally, you can introduce multiple objects into the image simultaneously and control where they appear by providing box location.

This is very interesting. Could you provide a couple of samples for us to see and realize this advantage here? @patrickvonplaten considering this feature of some sort of zero-shot subject driven generation, I'd be in favor of graduating this pipeline in the core.

Of course, here is some sample with reference image: https://drive.google.com/drive/folders/1hIO-wvEoiTFWqCckdOgO7k0ilUniz9pQ?usp=sharing
Regarding the reference style, it seems like I have missed implementing some components. I think it's not too difficult, and I'm quite confident that I will complete it tomorrow.

tuanh123789 · 2023-08-26T07:19:13Z

In my experience, using reference images with high resolution will help improve the quality of the generated images.

patrickvonplaten · 2023-08-26T17:55:24Z

So far downloads have not be going up that much IMO:
https://huggingface.co/masterful/gligen-1-4-generation-text-box
(but trend is going upwards)

If you feel strongly about it @sayakpaul, let's add it to core (not 100% convinced at the moment, but happy to be proven wrong!)

tuanh123789 · 2023-08-26T18:12:03Z

@sayakpaul
I have updated the samples of the pipeline in the directory https://drive.google.com/drive/folders/1hIO-wvEoiTFWqCckdOgO7k0ilUniz9pQ?usp=sharing. Additionally, I have also implemented the missing features for style transfer. You can take a look, and I have also addressed the remaining errors. Could you please review it further? Thank you.

sayakpaul · 2023-08-28T08:35:27Z

Wow! The results indeed seem quite amazing here. Thanks for your contributions!

sayakpaul · 2023-08-28T08:36:22Z

src/diffusers/models/attention.py


        # 4. Fuser
-        if attention_type == "gated":
+        if attention_type == "gated" or attention_type == "GatedTextImage":


"gated_text_image" is a better name here IMO.

Suggested change

if attention_type == "gated" or attention_type == "GatedTextImage":

if attention_type == "gated" or attention_type == "gated-text-image":

Can we stick to lowercase here? :-)

sayakpaul · 2023-08-28T08:36:33Z

src/diffusers/models/unet_2d_condition.py

+
            self.position_net = PositionNet(positive_len=positive_len, out_dim=cross_attention_dim)

+        elif attention_type == "GatedTextImage":


Same as above.

sayakpaul · 2023-08-28T08:38:10Z

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_gligen_text_image.py

+        >>> boxes = [[0.2676, 0.6088, 0.4773, 0.7183]]
+        >>> phrases = None
+        >>> gligen_image = load_image(
+        ...     "https://www.southernliving.com/thmb/6jANEFrMvwSWlRlxCDCzulxXQZY=/1500x0/filters:no_upscale():max_bytes(150000):strip_icc()/2641101_Funfetti_Cake_702-2000-a2d8f835fd8f4a928fa17222e71241c3.jpg"


Could we please use a shorter image URL here? Feel free to submit a PR to https://huggingface.co/datasets/huggingface/documentation-images/tree/main/diffusers and use the links from there.

sayakpaul · 2023-08-28T08:39:51Z

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_gligen_text_image.py

+
+        >>> # Insert objects described by image at the region defined by bounding boxes
+        >>> pipe = StableDiffusionGLIGENTextImagePipeline.from_pretrained(
+        ...     "anhnct/Gligen_Inpainting_Text_Image", torch_dtype=torch.float16


These checkpoints might need regeneration if we modify the attention type value as suggested above.

Also, let's ensure the checkpoints uploaded have thorough model cards.

Let's add some visual examples to the model cards as mentioned earlier :-)

sayakpaul

Looks quite amazing to me.

How did you convert the checkpoints to make them compatible with diffusers? If you had to perform any conversion, then the conversion script should also be in this PR.
Let's add a doc entry to https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/gligen so that our users are aware of these capabilities.

sayakpaul · 2023-08-28T08:43:04Z

Also, we won't be able to merge until and unless the CI is green.

tuanh123789 · 2023-08-28T09:20:25Z

@sayakpaul
To switch from the official checkpoint to diffusers, I've written a piece of code for conversion. I will add it to this PR along with adding a doc entry and a model card. I will inform you as soon as I complete it. Thank you very much.

tuanh123789 · 2023-08-28T09:24:01Z

Also, we won't be able to merge until and unless the CI is green.

I'll fix this too

nikhil-masterful · 2023-08-28T16:23:17Z

"Inpainting conditioned on image" - This is great ! thanks for adding @tuanh123789
Results look great !

@sayakpaul I'm happy to review it if it's helpful.

tuanh123789 · 2023-08-31T03:57:37Z

Hi @sayakpaul,
I have converted the projection_matrix to be a part of the pipeline and made it can load from checkpoint using from_pretrained method. Additionally, I have made the last changes to the encode_prompt() function, added brief docstrings, and included visual examples of results in the model card. Could you please review these? Thank you.

patrickvonplaten · 2023-08-31T07:06:12Z

src/diffusers/models/embeddings.py


 class PositionNet(nn.Module):
-    def __init__(self, positive_len, out_dim, fourier_freqs=8):
+    def __init__(self, positive_len, out_dim, feature_type, fourier_freqs=8):


Suggested change

def __init__(self, positive_len, out_dim, feature_type, fourier_freqs=8):

def __init__(self, positive_len, out_dim, feature_type = "text-only", fourier_freqs=8):

for backwards compatibility

patrickvonplaten · 2023-08-31T07:08:39Z

src/diffusers/models/embeddings.py

        return objs
+
+
+class CLIPImageProjection(ModelMixin, ConfigMixin):


Can we move this into a new clip_image_project_model.py file inside the stable_diffusion folder?

We should not have model mixin classes in the src/diffusers/models/embeddings.py file

Ok, I'll fix this

patrickvonplaten · 2023-08-31T12:54:48Z

src/diffusers/pipelines/stable_diffusion/__init__.py

 except OptionalDependencyNotAvailable:
    from ...utils.dummy_torch_and_transformers_objects import *  # noqa F403
 else:
+    from .clip_image_project_model import CLIPImageProjection


patrickvonplaten

Looks good to me now! Ok to merge once @sayakpaul gives the 🟢 light!

Great job @tuanh123789 !

tuanh123789 · 2023-08-31T13:29:58Z

Looks good to me now! Ok to merge once @sayakpaul gives the 🟢 light!

Great job @tuanh123789 !

Thank you for support.

sayakpaul · 2023-09-01T01:37:12Z

src/diffusers/models/embeddings.py

-        # embedding position (it may includes padding as placeholder)
-        xyxy_embedding = self.fourier_embedder(boxes)  # B*N*4 -> B*N*C
-
-        # learnable null embedding
-        positive_null = self.null_positive_feature.view(1, 1, -1)
+        xyxy_embedding = self.fourier_embedder(boxes)


Why are the comments going away?

sayakpaul · 2023-09-01T01:37:51Z

src/diffusers/models/embeddings.py

-        objs = self.linears(torch.cat([positive_embeddings, xyxy_embedding], dim=-1))
+        if positive_embeddings is not None:
+            positive_null = self.null_positive_feature.view(1, 1, -1)
+            positive_embeddings = positive_embeddings * masks + (1 - masks) * positive_null
+
+            objs = self.linears(torch.cat([positive_embeddings, xyxy_embedding], dim=-1))


Cool! I would still prefer to have the comments as this part of the code is a bit involved to navigate through.

sayakpaul · 2023-09-01T01:38:11Z

src/diffusers/models/unet_2d_condition.py

        )

-        if attention_type == "gated":
+        if attention_type in ["gated", "gated-text-image"]:


sayakpaul · 2023-09-01T01:43:57Z

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_gligen_text_image.py

+            )
+        return image, has_nsfw_concept
+
+    def prepare_extra_step_kwargs(self, generator, eta):


Is it not copied from?

sayakpaul · 2023-09-01T01:44:16Z

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_gligen_text_image.py

+            extra_step_kwargs["generator"] = generator
+        return extra_step_kwargs
+
+    def check_inputs(


No copy? If not, ignore the comment.

sayakpaul · 2023-09-01T01:44:47Z

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_gligen_text_image.py

+
+        return out
+
+    def get_cross_attention_kwargs_without_grounded(self, hidden_size, repeat_batch, max_objs, device):


Much better name this one! Thank you.

sayakpaul · 2023-09-01T01:47:12Z

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_gligen_text_image.py

+            inputs = self.processor(images=[input], return_tensors="pt").to(device)
+            outputs = self.image_encoder(**inputs)
+            feature = outputs.image_embeds
+            feature = self.image_project(feature).squeeze(0)


Sorry for iterating here a bit.

We do have a projection class for the CLIP vision tower in transformers: https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPVisionModelWithProjection.

Is it possible to leverage pre-trained projection matrix and populate the projection layer of that class and serialize it? That way we won't need a separate class for that.

@patrickvonplaten WDYT?

sayakpaul

Thanks a lot for working on this one. Just left a final round of comments.

tuanh123789 · 2023-09-01T07:09:35Z

Hi @sayakpaul
Can you review the changes and can we merge. Thank you.

sayakpaul · 2023-09-01T10:18:14Z

Thanks for your amazing contributions ❤️

tuanh123789 · 2023-09-01T12:14:59Z

Thanks for your amazing contributions ❤️

Thank you so much for big support @sayakpaul @patrickvonplaten. Without everyone's help I could not have completed it. Extremely grateful !

patrickvonplaten · 2023-09-01T13:31:40Z

Great work @tuanh123789 !

* Add GLIGEN Text Image implementation * add style transfer from image * fix check_repository_consistency * add convert script GLIGEN model to Diffusers * rename attention type * fix style code * remove PositionNetTextImage * Revert "fix check_repository_consistency" This reverts commit 15f098c. * change attention type name * update docs for GLIGEN * change examples with hf-document-image * fix style * add CLIPImageProjection for GLIGEN * Add new encode_prompt, load project matrix in pipe init * move CLIPImageProjection to stable_diffusion * add comment

Add GLIGEN Text Image implementation

0c48be0

tuanh123789 added 2 commits August 27, 2023 00:58

add style transfer from image

96aa4db

fix check_repository_consistency

15f098c

sayakpaul reviewed Aug 28, 2023

View reviewed changes

sayakpaul approved these changes Aug 28, 2023

View reviewed changes

tuanh123789 closed this Aug 28, 2023

tuanh123789 reopened this Aug 28, 2023

Add new encode_prompt, load project matrix in pipe init

bf3b58f

patrickvonplaten reviewed Aug 31, 2023

View reviewed changes

move CLIPImageProjection to stable_diffusion

987e6ba

tuanh123789 requested a review from patrickvonplaten August 31, 2023 11:27

patrickvonplaten reviewed Aug 31, 2023

View reviewed changes

patrickvonplaten approved these changes Aug 31, 2023

View reviewed changes

sayakpaul reviewed Sep 1, 2023

View reviewed changes

src/diffusers/models/unet_2d_condition.py

)

if attention_type == "gated":

if attention_type in ["gated", "gated-text-image"]:

Copy link

Member

sayakpaul Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

sayakpaul reviewed Sep 1, 2023

View reviewed changes

sayakpaul approved these changes Sep 1, 2023

View reviewed changes

add comment

92d942f

tuanh123789 requested a review from sayakpaul September 1, 2023 08:30

sayakpaul merged commit 38466c3 into huggingface:main Sep 1, 2023

chuzhdontcode mentioned this pull request Oct 6, 2023

Fix [core/GLIGEN]: TypeError when iterating over 0-d tensor with In-painting mode when EulerAncestralDiscreteScheduler is used #5305

Merged

6 tasks

gokyeongryeol mentioned this pull request Dec 2, 2024

Add InstanceDiffusion implementation #10079

Open

6 tasks

	if attention_type == "gated" or attention_type == "GatedTextImage":
	if attention_type == "gated" or attention_type == "gated-text-image":


		self.position_net = PositionNet(positive_len=positive_len, out_dim=cross_attention_dim)

		elif attention_type == "GatedTextImage":

	def __init__(self, positive_len, out_dim, feature_type, fourier_freqs=8):
	def __init__(self, positive_len, out_dim, feature_type = "text-only", fourier_freqs=8):

		return objs


		class CLIPImageProjection(ModelMixin, ConfigMixin):


		return out

		def get_cross_attention_kwargs_without_grounded(self, hidden_size, repeat_batch, max_objs, device):

Add GLIGEN Text Image implementation #4777

Add GLIGEN Text Image implementation #4777

Uh oh!

Conversation

tuanh123789 commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

tuanh123789 commented Aug 25, 2023

Uh oh!

tuanh123789 commented Aug 25, 2023

Uh oh!

sayakpaul commented Aug 25, 2023

Uh oh!

tuanh123789 commented Aug 25, 2023

Uh oh!

nikhil-masterful commented Aug 25, 2023

Uh oh!

patrickvonplaten commented Aug 25, 2023

Uh oh!

tuanh123789 commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Aug 26, 2023

Uh oh!

tuanh123789 commented Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Aug 26, 2023

Uh oh!

HuggingFaceDocBuilderDev commented Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tuanh123789 commented Aug 26, 2023

Uh oh!

tuanh123789 commented Aug 26, 2023

Uh oh!

patrickvonplaten commented Aug 26, 2023

Uh oh!

tuanh123789 commented Aug 26, 2023

Uh oh!

sayakpaul commented Aug 28, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented Aug 28, 2023

Uh oh!

tuanh123789 commented Aug 28, 2023

Uh oh!

tuanh123789 commented Aug 28, 2023

Uh oh!

nikhil-masterful commented Aug 28, 2023

Uh oh!

tuanh123789 commented Aug 31, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tuanh123789 commented Aug 25, 2023 •

edited

Loading

tuanh123789 commented Aug 25, 2023 •

edited

Loading

tuanh123789 commented Aug 26, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 26, 2023 •

edited

Loading