Is there a way to generate a single image using multiple GPUs? #11108

suzukimain · 2025-03-18T13:43:05Z

This is related to #2977 and #3392, but I would like to know how to generate a single image using multiple GPUs. If such a method does not exist, I would also like to know if Accelerate's Memory-efficient pipeline parallelism can be applied to this.

asomoza · 2025-03-18T14:20:59Z

Hi, there's multiple ways to interpret "generate a single image using multiple GPUs" so maybe you can be more specific about this. For example the most basic way of doing this is splitting the different steps and models into separate GPUs, so for example the text encoders and VAE on GPU 0 and the unet/transformer model in GPU 1, you can do this easily manually or you can do it with accelerate which is also covered in the docs in device placement

But I'm guessing you're referring to model sharding which you can read in the docs.

Also in the same part you can read about accelerate and parallelism.

suzukimain · 2025-03-19T08:57:29Z

Hi, there's multiple ways to interpret "generate a single image using multiple GPUs" so maybe you can be more specific about this. For example the most basic way of doing this is splitting the different steps and models into separate GPUs, so for example the text encoders and VAE on GPU 0 and the unet/transformer model in GPU 1, you can do this easily manually or you can do it with accelerate which is also covered in the docs in device placement

But I'm guessing you're referring to model sharding which you can read in the docs.

Also in the same part you can read about accelerate and parallelism.

Hello,
Thank you for your response.
Just to add for the record, I would like to be able to generate one image from one prompt faster by using several different GPUs, as in this comment.
For example, I want to generate an image faster using two different GPUs in the StableDiffusionPipeline.

asomoza · 2025-03-19T12:19:08Z

AFAIK you won't get faster inference times from multiple GPUs than a single one, the only reason you can get faster inference speeds with multiple GPUs would be if the model won't fit on a single one.

Do you have a reference where this is true?

a-r-r-o-w · 2025-03-19T12:53:02Z

@suzukimain Pipeline parallelism will not be ideal for your use case. It is typically better for really large models when you want to generate >1 image faster than generating them individually on multiple GPUs as done in data parallelism/sharding. It's also more well suited for training rather than inference.

What you're looking for, with single image multi-GPU, is tensor and context parallelism. These methods allow you to significantly speedup generation. Two good starting points are:

We have plans for natively supporting tensor parallelism soon. It's not very hard to implement yourself though, and Pytorch's DTensor API has a small learning curve -- you can give this a look: https://pytorch.org/tutorials/intermediate/TP_tutorial.html

Context parallelism can give you the fastest way to do single image multi-GPU, but is conceptually harder to understand. The simplest variant that you could try looking at is Ring attention. It involves splitting the attention query/key/value tensors across the sequence dimension cleverly, performing partial computations on each GPU, and combining the partials to get the attention output. Here's a quick google search if you're interested: https://coconut-mode.com/posts/ring-attention/. If you'd like to just use something that works out-of-the-box without diving into the theory too much, pytorch has experimental support that you can look into.

asomoza · 2025-03-19T14:24:55Z

oh I must add, xDiT and ParaAttention are for the transformer models, I was fixated on StableDiffusion 1.5 and Unets in my response because of the context issues OP posted.

a-r-r-o-w · 2025-03-19T14:34:23Z

Thanks for clarifying! xDiT and ParaAttention support tensor/context parallel which can be used with any model that contains feed-forward and attention layers. So SD1.5 can work with it too (might need some modifications), even if it's a unet arch.

Although, it's a very small model and the GPU communication overhead may outweigh the benefits of applying these techniques.

asomoza · 2025-03-19T14:59:38Z

mostly I was thinking that, hence my answer, for a small model I'm almost sure that the GPU communication will be a lot slower than just using a single GPU or in the best case scenario, probably the performance gain won't be enough to justify multiple GPUs for a single image inference.

Also for ParaAttention I was mostly using what I've read as a response from the author, and xDiT because of the name and the models it supports.

But as always, this would need testing, @suzukimain if you decide to test this, please let us know if you succeed or your experience with them.

a-r-r-o-w · 2025-03-20T05:07:41Z

I just remembered another thing. Since most models use CFG to generate high quality images, you can parallelize across the batch dimension (essentially data parallelism but for single image). This can roughly speedup generation by 30-50% (if the two GPUs you're using are the same). The downside is that it requires each GPU to be able to fit the model entirely. Ofcourse, you can apply clever offloading but for SD1.5, this should work great even on low VRAM!

This can serve as an easy starting point: #10879 (but it's not going to be merged as it's an example for a blog on examples of custom hooks).

asomoza · 2025-03-20T05:17:37Z

Also I was thinking on consumer grade infrastructure, if you are planning a commercial solution, the new NVLink Switches that NVidia presented yesterday have more bandwidth than a 5090 (as an example), so communication lag between GPUs is not an issue if you have the money for it.

Eamymao · 2025-04-08T03:15:10Z

If you just want to accelerate the image generation, i would like to recommend you lyraDiff, a speedup tool for diffusers.
As someone who runs SD and FLUX models daily, this useful framework to speed up my image generation is so impressive!

As described, it only cost half the time to generate a 1024*1024 image than the original diffusers. At the same time, the image quality is not lost due to acceleration. Moreover, the code is very similar to diffusers and is easy to use.

Here are the github link: https://github.com/TMElyralab/lyraDiff
With this tool, maybe only on one GPU you can also achieve the acceleration you want!

github-actions · 2025-05-02T15:03:35Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

asomoza · 2025-05-02T21:00:03Z

I'm converting this to a discussion since it's not really an issue and it's and interesting topic that can be discussed further.

github-actions bot added the stale Issues that haven't received updates label May 2, 2025

huggingface locked and limited conversation to collaborators May 2, 2025

asomoza converted this issue into discussion #11483 May 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Is there a way to generate a single image using multiple GPUs? #11108

Is there a way to generate a single image using multiple GPUs? #11108

suzukimain commented Mar 18, 2025

asomoza commented Mar 18, 2025 •

edited

Loading

suzukimain commented Mar 19, 2025

asomoza commented Mar 19, 2025

a-r-r-o-w commented Mar 19, 2025

asomoza commented Mar 19, 2025

a-r-r-o-w commented Mar 19, 2025

asomoza commented Mar 19, 2025 •

edited

Loading

a-r-r-o-w commented Mar 20, 2025

asomoza commented Mar 20, 2025

Eamymao commented Apr 8, 2025

github-actions bot commented May 2, 2025

asomoza commented May 2, 2025

This issue was moved to a discussion.

This issue was moved to a discussion.

Is there a way to generate a single image using multiple GPUs? #11108

Is there a way to generate a single image using multiple GPUs? #11108

Comments

suzukimain commented Mar 18, 2025

asomoza commented Mar 18, 2025 • edited Loading

suzukimain commented Mar 19, 2025

asomoza commented Mar 19, 2025

a-r-r-o-w commented Mar 19, 2025

asomoza commented Mar 19, 2025

a-r-r-o-w commented Mar 19, 2025

asomoza commented Mar 19, 2025 • edited Loading

a-r-r-o-w commented Mar 20, 2025

asomoza commented Mar 20, 2025

Eamymao commented Apr 8, 2025

github-actions bot commented May 2, 2025

asomoza commented May 2, 2025

This issue was moved to a discussion.

asomoza commented Mar 18, 2025 •

edited

Loading

asomoza commented Mar 19, 2025 •

edited

Loading