Thanks to visit codestin.com
Credit goes to github.com

Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Is there a way to generate a single image using multiple GPUs? #11108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
suzukimain opened this issue Mar 18, 2025 · 12 comments
Closed

Is there a way to generate a single image using multiple GPUs? #11108

suzukimain opened this issue Mar 18, 2025 · 12 comments
Labels
stale Issues that haven't received updates

Comments

@suzukimain
Copy link
Contributor

This is related to #2977 and #3392, but I would like to know how to generate a single image using multiple GPUs. If such a method does not exist, I would also like to know if Accelerate's Memory-efficient pipeline parallelism can be applied to this.

@asomoza
Copy link
Member

asomoza commented Mar 18, 2025

Hi, there's multiple ways to interpret "generate a single image using multiple GPUs" so maybe you can be more specific about this. For example the most basic way of doing this is splitting the different steps and models into separate GPUs, so for example the text encoders and VAE on GPU 0 and the unet/transformer model in GPU 1, you can do this easily manually or you can do it with accelerate which is also covered in the docs in device placement

But I'm guessing you're referring to model sharding which you can read in the docs.

Also in the same part you can read about accelerate and parallelism.

@suzukimain
Copy link
Contributor Author

Hi, there's multiple ways to interpret "generate a single image using multiple GPUs" so maybe you can be more specific about this. For example the most basic way of doing this is splitting the different steps and models into separate GPUs, so for example the text encoders and VAE on GPU 0 and the unet/transformer model in GPU 1, you can do this easily manually or you can do it with accelerate which is also covered in the docs in device placement

But I'm guessing you're referring to model sharding which you can read in the docs.

Also in the same part you can read about accelerate and parallelism.

Hello,
Thank you for your response.
Just to add for the record, I would like to be able to generate one image from one prompt faster by using several different GPUs, as in this comment.
For example, I want to generate an image faster using two different GPUs in the StableDiffusionPipeline.

@asomoza
Copy link
Member

asomoza commented Mar 19, 2025

AFAIK you won't get faster inference times from multiple GPUs than a single one, the only reason you can get faster inference speeds with multiple GPUs would be if the model won't fit on a single one.

Do you have a reference where this is true?

@a-r-r-o-w
Copy link
Member

@suzukimain Pipeline parallelism will not be ideal for your use case. It is typically better for really large models when you want to generate >1 image faster than generating them individually on multiple GPUs as done in data parallelism/sharding. It's also more well suited for training rather than inference.

What you're looking for, with single image multi-GPU, is tensor and context parallelism. These methods allow you to significantly speedup generation. Two good starting points are:

We have plans for natively supporting tensor parallelism soon. It's not very hard to implement yourself though, and Pytorch's DTensor API has a small learning curve -- you can give this a look: https://pytorch.org/tutorials/intermediate/TP_tutorial.html

Context parallelism can give you the fastest way to do single image multi-GPU, but is conceptually harder to understand. The simplest variant that you could try looking at is Ring attention. It involves splitting the attention query/key/value tensors across the sequence dimension cleverly, performing partial computations on each GPU, and combining the partials to get the attention output. Here's a quick google search if you're interested: https://coconut-mode.com/posts/ring-attention/. If you'd like to just use something that works out-of-the-box without diving into the theory too much, pytorch has experimental support that you can look into.

@asomoza
Copy link
Member

asomoza commented Mar 19, 2025

oh I must add, xDiT and ParaAttention are for the transformer models, I was fixated on StableDiffusion 1.5 and Unets in my response because of the context issues OP posted.

@a-r-r-o-w
Copy link
Member

Thanks for clarifying! xDiT and ParaAttention support tensor/context parallel which can be used with any model that contains feed-forward and attention layers. So SD1.5 can work with it too (might need some modifications), even if it's a unet arch.

Although, it's a very small model and the GPU communication overhead may outweigh the benefits of applying these techniques.

@asomoza
Copy link
Member

asomoza commented Mar 19, 2025

mostly I was thinking that, hence my answer, for a small model I'm almost sure that the GPU communication will be a lot slower than just using a single GPU or in the best case scenario, probably the performance gain won't be enough to justify multiple GPUs for a single image inference.

Also for ParaAttention I was mostly using what I've read as a response from the author, and xDiT because of the name and the models it supports.

But as always, this would need testing, @suzukimain if you decide to test this, please let us know if you succeed or your experience with them.

@a-r-r-o-w
Copy link
Member

I just remembered another thing. Since most models use CFG to generate high quality images, you can parallelize across the batch dimension (essentially data parallelism but for single image). This can roughly speedup generation by 30-50% (if the two GPUs you're using are the same). The downside is that it requires each GPU to be able to fit the model entirely. Ofcourse, you can apply clever offloading but for SD1.5, this should work great even on low VRAM!

This can serve as an easy starting point: #10879 (but it's not going to be merged as it's an example for a blog on examples of custom hooks).

@asomoza
Copy link
Member

asomoza commented Mar 20, 2025

Also I was thinking on consumer grade infrastructure, if you are planning a commercial solution, the new NVLink Switches that NVidia presented yesterday have more bandwidth than a 5090 (as an example), so communication lag between GPUs is not an issue if you have the money for it.

@Eamymao
Copy link

Eamymao commented Apr 8, 2025

If you just want to accelerate the image generation, i would like to recommend you lyraDiff, a speedup tool for diffusers.
As someone who runs SD and FLUX models daily, this useful framework to speed up my image generation is so impressive!

As described, it only cost half the time to generate a 1024*1024 image than the original diffusers. At the same time, the image quality is not lost due to acceleration. Moreover, the code is very similar to diffusers and is easy to use.

Here are the github link: https://github.com/TMElyralab/lyraDiff
With this tool, maybe only on one GPU you can also achieve the acceleration you want!

Copy link
Contributor

github-actions bot commented May 2, 2025

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label May 2, 2025
@asomoza
Copy link
Member

asomoza commented May 2, 2025

I'm converting this to a discussion since it's not really an issue and it's and interesting topic that can be discussed further.

@huggingface huggingface locked and limited conversation to collaborators May 2, 2025
@asomoza asomoza converted this issue into discussion #11483 May 2, 2025

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

4 participants