-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Add GLIGEN Text Image implementation #4777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@patrickvonplaten @sayakpaul @stevhliu |
|
Model checkpoint available on huggingface |
|
Can you check @nikhil-masterful |
|
I'm a little under the weather right now, but will take a look over the weekend |
|
Hey @tuanh123789 , Thanks a lot for your PR! Can we maybe put this pipeline in the community folder: https://github.com/huggingface/diffusers/tree/main/examples/community? We're sadly a bit overwhelmed by maintenance and aren't able to keep our slow tests clean - if we put this pipeline in the community folder, you can be the official author and the burden to maintain the pipeline is loosened. Would that be ok for you? |
|
@patrickvonplaten |
|
Which use cases of GLIGEN are we missing? |
|
@sayakpaul
|
This is very interesting. Could you provide a couple of samples for us to see and realize this advantage here? @patrickvonplaten considering this feature of some sort of zero-shot subject driven generation, I'd be in favor of graduating this pipeline in the core. |
|
The documentation is not available anymore as the PR was closed or merged. |
Of course, here is some sample with reference image: https://drive.google.com/drive/folders/1hIO-wvEoiTFWqCckdOgO7k0ilUniz9pQ?usp=sharing |
|
In my experience, using reference images with high resolution will help improve the quality of the generated images. |
|
So far downloads have not be going up that much IMO: If you feel strongly about it @sayakpaul, let's add it to core (not 100% convinced at the moment, but happy to be proven wrong!) |
|
@sayakpaul |
|
Wow! The results indeed seem quite amazing here. Thanks for your contributions! |
src/diffusers/models/attention.py
Outdated
|
|
||
| # 4. Fuser | ||
| if attention_type == "gated": | ||
| if attention_type == "gated" or attention_type == "GatedTextImage": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"gated_text_image" is a better name here IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if attention_type == "gated" or attention_type == "GatedTextImage": | |
| if attention_type == "gated" or attention_type == "gated-text-image": |
Can we stick to lowercase here? :-)
|
|
||
| self.position_net = PositionNet(positive_len=positive_len, out_dim=cross_attention_dim) | ||
|
|
||
| elif attention_type == "GatedTextImage": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
| >>> boxes = [[0.2676, 0.6088, 0.4773, 0.7183]] | ||
| >>> phrases = None | ||
| >>> gligen_image = load_image( | ||
| ... "https://www.southernliving.com/thmb/6jANEFrMvwSWlRlxCDCzulxXQZY=/1500x0/filters:no_upscale():max_bytes(150000):strip_icc()/2641101_Funfetti_Cake_702-2000-a2d8f835fd8f4a928fa17222e71241c3.jpg" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please use a shorter image URL here? Feel free to submit a PR to https://huggingface.co/datasets/huggingface/documentation-images/tree/main/diffusers and use the links from there.
| >>> # Insert objects described by image at the region defined by bounding boxes | ||
| >>> pipe = StableDiffusionGLIGENTextImagePipeline.from_pretrained( | ||
| ... "anhnct/Gligen_Inpainting_Text_Image", torch_dtype=torch.float16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These checkpoints might need regeneration if we modify the attention type value as suggested above.
Also, let's ensure the checkpoints uploaded have thorough model cards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add some visual examples to the model cards as mentioned earlier :-)
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks quite amazing to me.
- How did you convert the checkpoints to make them compatible with
diffusers? If you had to perform any conversion, then the conversion script should also be in this PR. - Let's add a doc entry to https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/gligen so that our users are aware of these capabilities.
|
Also, we won't be able to merge until and unless the CI is green. |
|
@sayakpaul |
I'll fix this too |
|
"Inpainting conditioned on image" - This is great ! thanks for adding @tuanh123789 @sayakpaul I'm happy to review it if it's helpful. |
|
Hi @sayakpaul, |
src/diffusers/models/embeddings.py
Outdated
|
|
||
| class PositionNet(nn.Module): | ||
| def __init__(self, positive_len, out_dim, fourier_freqs=8): | ||
| def __init__(self, positive_len, out_dim, feature_type, fourier_freqs=8): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def __init__(self, positive_len, out_dim, feature_type, fourier_freqs=8): | |
| def __init__(self, positive_len, out_dim, feature_type = "text-only", fourier_freqs=8): |
for backwards compatibility
src/diffusers/models/embeddings.py
Outdated
| return objs | ||
|
|
||
|
|
||
| class CLIPImageProjection(ModelMixin, ConfigMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this into a new clip_image_project_model.py file inside the stable_diffusion folder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not have model mixin classes in the src/diffusers/models/embeddings.py file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll fix this
| except OptionalDependencyNotAvailable: | ||
| from ...utils.dummy_torch_and_transformers_objects import * # noqa F403 | ||
| else: | ||
| from .clip_image_project_model import CLIPImageProjection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok!
patrickvonplaten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now! Ok to merge once @sayakpaul gives the 🟢 light!
Great job @tuanh123789 !
Thank you for support. |
src/diffusers/models/embeddings.py
Outdated
| # embedding position (it may includes padding as placeholder) | ||
| xyxy_embedding = self.fourier_embedder(boxes) # B*N*4 -> B*N*C | ||
|
|
||
| # learnable null embedding | ||
| positive_null = self.null_positive_feature.view(1, 1, -1) | ||
| xyxy_embedding = self.fourier_embedder(boxes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the comments going away?
| objs = self.linears(torch.cat([positive_embeddings, xyxy_embedding], dim=-1)) | ||
| if positive_embeddings is not None: | ||
| positive_null = self.null_positive_feature.view(1, 1, -1) | ||
| positive_embeddings = positive_embeddings * masks + (1 - masks) * positive_null | ||
|
|
||
| objs = self.linears(torch.cat([positive_embeddings, xyxy_embedding], dim=-1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! I would still prefer to have the comments as this part of the code is a bit involved to navigate through.
| ) | ||
|
|
||
| if attention_type == "gated": | ||
| if attention_type in ["gated", "gated-text-image"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
| ) | ||
| return image, has_nsfw_concept | ||
|
|
||
| def prepare_extra_step_kwargs(self, generator, eta): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not copied from?
| extra_step_kwargs["generator"] = generator | ||
| return extra_step_kwargs | ||
|
|
||
| def check_inputs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No copy? If not, ignore the comment.
|
|
||
| return out | ||
|
|
||
| def get_cross_attention_kwargs_without_grounded(self, hidden_size, repeat_batch, max_objs, device): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better name this one! Thank you.
| inputs = self.processor(images=[input], return_tensors="pt").to(device) | ||
| outputs = self.image_encoder(**inputs) | ||
| feature = outputs.image_embeds | ||
| feature = self.image_project(feature).squeeze(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for iterating here a bit.
We do have a projection class for the CLIP vision tower in transformers: https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPVisionModelWithProjection.
Is it possible to leverage pre-trained projection matrix and populate the projection layer of that class and serialize it? That way we won't need a separate class for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patrickvonplaten WDYT?
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this one. Just left a final round of comments.
|
Hi @sayakpaul |
|
Thanks for your amazing contributions ❤️ |
Thank you so much for big support @sayakpaul @patrickvonplaten. Without everyone's help I could not have completed it. Extremely grateful ! |
|
Great work @tuanh123789 ! |
* Add GLIGEN Text Image implementation * add style transfer from image * fix check_repository_consistency * add convert script GLIGEN model to Diffusers * rename attention type * fix style code * remove PositionNetTextImage * Revert "fix check_repository_consistency" This reverts commit 15f098c. * change attention type name * update docs for GLIGEN * change examples with hf-document-image * fix style * add CLIPImageProjection for GLIGEN * Add new encode_prompt, load project matrix in pipe init * move CLIPImageProjection to stable_diffusion * add comment
* Add GLIGEN Text Image implementation * add style transfer from image * fix check_repository_consistency * add convert script GLIGEN model to Diffusers * rename attention type * fix style code * remove PositionNetTextImage * Revert "fix check_repository_consistency" This reverts commit 15f098c. * change attention type name * update docs for GLIGEN * change examples with hf-document-image * fix style * add CLIPImageProjection for GLIGEN * Add new encode_prompt, load project matrix in pipe init * move CLIPImageProjection to stable_diffusion * add comment

What does this PR do?
In #4441, there is a mention of the GLIGEN model, but I noticed that it doesn't include all the models referenced in the paper. In this Pull Request, I propose adding two additional models: Generation and Inpainting, with inputs including Text, Box, and particularly Image. With this model, users can input any object without the need to use Textual Inversion, Dreambooth, or LoRA.
All two checkpoint is converted from offical weight.
Fixes # (issue)
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.