Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@tuanh123789
Copy link
Contributor

@tuanh123789 tuanh123789 commented Aug 25, 2023

What does this PR do?

In #4441, there is a mention of the GLIGEN model, but I noticed that it doesn't include all the models referenced in the paper. In this Pull Request, I propose adding two additional models: Generation and Inpainting, with inputs including Text, Box, and particularly Image. With this model, users can input any object without the need to use Textual Inversion, Dreambooth, or LoRA.
All two checkpoint is converted from offical weight.

Fixes # (issue)

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@tuanh123789
Copy link
Contributor Author

@patrickvonplaten @sayakpaul @stevhliu
Could anyone please review this. Thank you !

@tuanh123789
Copy link
Contributor Author

Model checkpoint available on huggingface
Inpainting: anhnct/Gligen_Inpainting_Text_Image
Generation: anhnct/Gligen_Text_Image

@sayakpaul
Copy link
Member

Cc: @nikhil-masterful

@tuanh123789
Copy link
Contributor Author

Can you check @nikhil-masterful

@nikhil-masterful
Copy link
Contributor

I'm a little under the weather right now, but will take a look over the weekend

@patrickvonplaten
Copy link
Contributor

Hey @tuanh123789 ,

Thanks a lot for your PR! Can we maybe put this pipeline in the community folder: https://github.com/huggingface/diffusers/tree/main/examples/community?

We're sadly a bit overwhelmed by maintenance and aren't able to keep our slow tests clean - if we put this pipeline in the community folder, you can be the official author and the burden to maintain the pipeline is loosened. Would that be ok for you?

@tuanh123789
Copy link
Contributor Author

tuanh123789 commented Aug 25, 2023

@patrickvonplaten
Thank you for responding. I understand everyone is busy, and it's okay to place this pipeline in the community folder. However, there's an issue: I'm not the author of this idea. I merely adapted the code and model weights from the original repository into a pipeline within Diffusers. Since this idea is based on GLIGEN paper (https://arxiv.org/abs/2301.07093), similar to the approach in #4441, I believe it would be more accurate if we put it in https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/stable_diffusion. What are your thoughts on this?

@sayakpaul
Copy link
Member

Which use cases of GLIGEN are we missing?

@tuanh123789
Copy link
Contributor Author

tuanh123789 commented Aug 26, 2023

@sayakpaul
In the GLIGEN paper, three main method types are mentioned:

  1. Input box location, prompt, and phrases for the objects we want to appear inside the box.

  2. Input box location, prompt, phrases, and an image of the specific object we want to appear inside the box.

  3. Input canny map, semantic map, depth map, and prompt (quite similar to Controlnet).

    In issue Add GLIGEN implementation #4441, @nikhil-masterful implemented the method mentioned in type 1 above. In this PR, I implemented the method from type 2. The advantage of this method is that you can place a real-world object or any image style into an image without needing methods like Textual Inversion, Dreambooth, or LoRA, you just need only reference image of each object or style. Additionally, you can introduce multiple objects into the image simultaneously and control where they appear by providing box location. It's important to note that methods 1 and 2 are separate pipelines, and PR Add GLIGEN implementation #4441 is not missing.

image

Which use cases of GLIGEN are we missing?

@sayakpaul
Copy link
Member

The advantage of this method is that you can place a real-world object or any image style into an image without needing methods like Textual Inversion, Dreambooth, or LoRA, you just need only reference image of each object or style. Additionally, you can introduce multiple objects into the image simultaneously and control where they appear by providing box location.

This is very interesting. Could you provide a couple of samples for us to see and realize this advantage here? @patrickvonplaten considering this feature of some sort of zero-shot subject driven generation, I'd be in favor of graduating this pipeline in the core.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 26, 2023

The documentation is not available anymore as the PR was closed or merged.

@tuanh123789
Copy link
Contributor Author

The advantage of this method is that you can place a real-world object or any image style into an image without needing methods like Textual Inversion, Dreambooth, or LoRA, you just need only reference image of each object or style. Additionally, you can introduce multiple objects into the image simultaneously and control where they appear by providing box location.

This is very interesting. Could you provide a couple of samples for us to see and realize this advantage here? @patrickvonplaten considering this feature of some sort of zero-shot subject driven generation, I'd be in favor of graduating this pipeline in the core.

Of course, here is some sample with reference image: https://drive.google.com/drive/folders/1hIO-wvEoiTFWqCckdOgO7k0ilUniz9pQ?usp=sharing
Regarding the reference style, it seems like I have missed implementing some components. I think it's not too difficult, and I'm quite confident that I will complete it tomorrow.

@tuanh123789
Copy link
Contributor Author

In my experience, using reference images with high resolution will help improve the quality of the generated images.

@patrickvonplaten
Copy link
Contributor

So far downloads have not be going up that much IMO:
https://huggingface.co/masterful/gligen-1-4-generation-text-box
(but trend is going upwards)

If you feel strongly about it @sayakpaul, let's add it to core (not 100% convinced at the moment, but happy to be proven wrong!)

@tuanh123789
Copy link
Contributor Author

@sayakpaul
I have updated the samples of the pipeline in the directory https://drive.google.com/drive/folders/1hIO-wvEoiTFWqCckdOgO7k0ilUniz9pQ?usp=sharing. Additionally, I have also implemented the missing features for style transfer. You can take a look, and I have also addressed the remaining errors. Could you please review it further? Thank you.

@sayakpaul
Copy link
Member

Wow! The results indeed seem quite amazing here. Thanks for your contributions!


# 4. Fuser
if attention_type == "gated":
if attention_type == "gated" or attention_type == "GatedTextImage":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"gated_text_image" is a better name here IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if attention_type == "gated" or attention_type == "GatedTextImage":
if attention_type == "gated" or attention_type == "gated-text-image":

Can we stick to lowercase here? :-)


self.position_net = PositionNet(positive_len=positive_len, out_dim=cross_attention_dim)

elif attention_type == "GatedTextImage":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

>>> boxes = [[0.2676, 0.6088, 0.4773, 0.7183]]
>>> phrases = None
>>> gligen_image = load_image(
... "https://www.southernliving.com/thmb/6jANEFrMvwSWlRlxCDCzulxXQZY=/1500x0/filters:no_upscale():max_bytes(150000):strip_icc()/2641101_Funfetti_Cake_702-2000-a2d8f835fd8f4a928fa17222e71241c3.jpg"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please use a shorter image URL here? Feel free to submit a PR to https://huggingface.co/datasets/huggingface/documentation-images/tree/main/diffusers and use the links from there.

>>> # Insert objects described by image at the region defined by bounding boxes
>>> pipe = StableDiffusionGLIGENTextImagePipeline.from_pretrained(
... "anhnct/Gligen_Inpainting_Text_Image", torch_dtype=torch.float16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These checkpoints might need regeneration if we modify the attention type value as suggested above.

Also, let's ensure the checkpoints uploaded have thorough model cards.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some visual examples to the model cards as mentioned earlier :-)

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite amazing to me.

@sayakpaul
Copy link
Member

Also, we won't be able to merge until and unless the CI is green.

@tuanh123789 tuanh123789 reopened this Aug 28, 2023
@tuanh123789
Copy link
Contributor Author

@sayakpaul
To switch from the official checkpoint to diffusers, I've written a piece of code for conversion. I will add it to this PR along with adding a doc entry and a model card. I will inform you as soon as I complete it. Thank you very much.

@tuanh123789
Copy link
Contributor Author

Also, we won't be able to merge until and unless the CI is green.

I'll fix this too

@nikhil-masterful
Copy link
Contributor

"Inpainting conditioned on image" - This is great ! thanks for adding @tuanh123789
Results look great !

@sayakpaul I'm happy to review it if it's helpful.

@tuanh123789
Copy link
Contributor Author

Hi @sayakpaul,
I have converted the projection_matrix to be a part of the pipeline and made it can load from checkpoint using from_pretrained method. Additionally, I have made the last changes to the encode_prompt() function, added brief docstrings, and included visual examples of results in the model card. Could you please review these? Thank you.


class PositionNet(nn.Module):
def __init__(self, positive_len, out_dim, fourier_freqs=8):
def __init__(self, positive_len, out_dim, feature_type, fourier_freqs=8):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, positive_len, out_dim, feature_type, fourier_freqs=8):
def __init__(self, positive_len, out_dim, feature_type = "text-only", fourier_freqs=8):

for backwards compatibility

return objs


class CLIPImageProjection(ModelMixin, ConfigMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this into a new clip_image_project_model.py file inside the stable_diffusion folder?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not have model mixin classes in the src/diffusers/models/embeddings.py file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll fix this

except OptionalDependencyNotAvailable:
from ...utils.dummy_torch_and_transformers_objects import * # noqa F403
else:
from .clip_image_project_model import CLIPImageProjection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok!

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now! Ok to merge once @sayakpaul gives the 🟢 light!

Great job @tuanh123789 !

@tuanh123789
Copy link
Contributor Author

Looks good to me now! Ok to merge once @sayakpaul gives the 🟢 light!

Great job @tuanh123789 !

Thank you for support.

Comment on lines 590 to 619
# embedding position (it may includes padding as placeholder)
xyxy_embedding = self.fourier_embedder(boxes) # B*N*4 -> B*N*C

# learnable null embedding
positive_null = self.null_positive_feature.view(1, 1, -1)
xyxy_embedding = self.fourier_embedder(boxes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the comments going away?

Comment on lines 601 to 627
objs = self.linears(torch.cat([positive_embeddings, xyxy_embedding], dim=-1))
if positive_embeddings is not None:
positive_null = self.null_positive_feature.view(1, 1, -1)
positive_embeddings = positive_embeddings * masks + (1 - masks) * positive_null

objs = self.linears(torch.cat([positive_embeddings, xyxy_embedding], dim=-1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! I would still prefer to have the comments as this part of the code is a bit involved to navigate through.

)

if attention_type == "gated":
if attention_type in ["gated", "gated-text-image"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

)
return image, has_nsfw_concept

def prepare_extra_step_kwargs(self, generator, eta):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not copied from?

extra_step_kwargs["generator"] = generator
return extra_step_kwargs

def check_inputs(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No copy? If not, ignore the comment.


return out

def get_cross_attention_kwargs_without_grounded(self, hidden_size, repeat_batch, max_objs, device):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better name this one! Thank you.

inputs = self.processor(images=[input], return_tensors="pt").to(device)
outputs = self.image_encoder(**inputs)
feature = outputs.image_embeds
feature = self.image_project(feature).squeeze(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for iterating here a bit.

We do have a projection class for the CLIP vision tower in transformers: https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPVisionModelWithProjection.

Is it possible to leverage pre-trained projection matrix and populate the projection layer of that class and serialize it? That way we won't need a separate class for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this one. Just left a final round of comments.

@tuanh123789
Copy link
Contributor Author

Hi @sayakpaul
Can you review the changes and can we merge. Thank you.

@sayakpaul sayakpaul merged commit 38466c3 into huggingface:main Sep 1, 2023
@sayakpaul
Copy link
Member

Thanks for your amazing contributions ❤️

@tuanh123789
Copy link
Contributor Author

Thanks for your amazing contributions ❤️

Thank you so much for big support @sayakpaul @patrickvonplaten. Without everyone's help I could not have completed it. Extremely grateful !

@patrickvonplaten
Copy link
Contributor

Great work @tuanh123789 !

yoonseokjin pushed a commit to yoonseokjin/diffusers that referenced this pull request Dec 25, 2023
* Add GLIGEN Text Image implementation

* add style transfer from image

* fix check_repository_consistency

* add convert script GLIGEN model to Diffusers

* rename attention type

* fix style code

* remove PositionNetTextImage

* Revert "fix check_repository_consistency"

This reverts commit 15f098c.

* change attention type name

* update docs for GLIGEN

* change examples with hf-document-image

* fix style

* add CLIPImageProjection for GLIGEN

* Add new encode_prompt, load project matrix in pipe init

* move CLIPImageProjection to stable_diffusion

* add comment
AmericanPresidentJimmyCarter pushed a commit to AmericanPresidentJimmyCarter/diffusers that referenced this pull request Apr 26, 2024
* Add GLIGEN Text Image implementation

* add style transfer from image

* fix check_repository_consistency

* add convert script GLIGEN model to Diffusers

* rename attention type

* fix style code

* remove PositionNetTextImage

* Revert "fix check_repository_consistency"

This reverts commit 15f098c.

* change attention type name

* update docs for GLIGEN

* change examples with hf-document-image

* fix style

* add CLIPImageProjection for GLIGEN

* Add new encode_prompt, load project matrix in pipe init

* move CLIPImageProjection to stable_diffusion

* add comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants