Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Is it possible to inference using multiple GPUs? #2977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
DevJunghun opened this issue Apr 5, 2023 · 18 comments
Closed

Is it possible to inference using multiple GPUs? #2977

DevJunghun opened this issue Apr 5, 2023 · 18 comments
Labels
stale Issues that haven't received updates

Comments

@DevJunghun
Copy link

Hi, Thanks for sharing this library for using stable diffusion.
There is one questions I want to ask.

Like title, Is it possible to inference using multiple GPUs? If possible, how?
do you share doc about inference using multiple GPUs?

OS: Linux Ubuntu 20.04
GPU: RTX 4090 (24GB) * n
RAM: 72GB
Python: 3.9.16

assume i have two stable diffusion models (model 1, model 2)
ex) GPU 1 - using model 1, GPU 2 - using model 2

or

assume i have two request, i want to process both request parallel (prompt 1, prompt 2)
ex) GPU 1 - processing prompt 1, GPU 2 - processing prompt 2

I think.. this question can be solved by using thread and two pipes like below.. right?

p_01 = StableDiffusionPipeline.from_pretrained(model_01).to("cuda:0")  
p_02 = StableDiffusionPipeline.from_pretrained(model_02).to("cuda:1")  

Thread(target=generate_pipe01, args=(prompt, negative_prompt)).start()  
Thread(target=generate_pipe02, args=(prompt, negative_prompt)).start()

I'll be waiting for your good opinions. Thank you.

@patrickvonplaten
Copy link
Contributor

Hey @DevJunghun,

here what I would recommend.

1.) Create a python file run_distributed.py that works in distributed mode. Note that we set the world_size here to 2 assuming that you want to run your code in parallel over 2 GPUs.

#!/usr/bin/env python3
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

from diffusers import DiffusionPipeline

sd = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)


def run_inference(rank, world_size):
    # create default process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)

    # move to rank
    sd.to(rank)

    if torch.distributed.get_rank() == 0:
        prompt = "a dog"
    elif torch.distributed.get_rank() == 1:
        prompt = "a cat"

    image = sd(prompt).images[0]
    image.save(f"./{'_'.join(prompt)}.png")


def main():
    world_size = 2
    mp.spawn(
        run_inference,
        args=(world_size,),
        nprocs=world_size,
        join=True
    )


if __name__ == "__main__":
    main()
  1. Having defined the script you can start it by just running:
torchrun run_distributed.py

@patrickvonplaten
Copy link
Contributor

Note that when using PyTorch's distributed data loaders you have much more control over what data goes to which GPU:
https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html

@patrickvonplaten
Copy link
Contributor

@sayakpaul @williamberman @pcuenca - it might be worth actually creating a quick doc page for this

@DevJunghun
Copy link
Author

@patrickvonplaten Thanks for your kindness! Have a nice day :)

@sayakpaul sayakpaul reopened this Apr 7, 2023
@sayakpaul
Copy link
Member

Reopening this issue to keep a better track of the doc @patrickvonplaten mentioned in #2977 (comment).

@muellerzr do you have any recommendations for this? Anything, in particular, we need to know for the accelerate side to run distributed inference? Any relevant pointers would be very useful for us to ensure the doc we're putting together sheds light into the best practices :)

@muellerzr
Copy link
Contributor

muellerzr commented Apr 7, 2023

@sayakpaul using accelerate launch removes any CLI specifics + spawning that Patrick showed, and you can use the PartialState for anything else @patrickvonplaten showed (such as the new PartialState().process_index, which is better for this stuff) to specify what GPU something should be run on. And in regards to .to(rank) you can use state.device

You can use AcceleratorState or Accelerator as well, but PartialState was designed for this more utility-focused approach

So in full as code:

accelerate launch file.py
#!/usr/bin/env python3
from accelerate import PartialState
from diffusers import DiffusionPipeline

sd = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)


def main():
    # Initialize the distributed environment
    state = PartialState()

    # move to rank
    sd.to(state.device)

    if state.process_index == 0:
        prompt = "a dog"
    elif state.process_index == 1:
        prompt = "a cat"

    image = sd(prompt).images[0]
    image.save(f"./{'_'.join(prompt)}.png")

if __name__ == "__main__":
    main()

@chikiuso
Copy link

is there any way I could inference with multi gpu on one single image with text prompt? thanks

@sayakpaul
Copy link
Member

@chikiuso could you describe your use case?

@chikiuso
Copy link

Hi @sayakpaul , I have 4 rtx 3090 gpu installed on ubuntu server, I would like to inference a text prompt to image as fast as possible (not each gpu process one prompt), to use 4 gpu to process one single image at a time, is it possible? thanks.

@sayakpaul
Copy link
Member

Still not clear to me.

Are you trying to generate four images for a given (single in this case) prompt?

@chikiuso
Copy link

Hi @sayakpaul Sorry for my bad English, I am trying to generate one single image with one single prompt at the same time. thanks.

@sayakpaul
Copy link
Member

No problem. I am just trying to understand better to get your issue resolved.

I am trying to generate one single image with one single prompt at the same time. thanks.

Then, doesn't #2977 (comment) work?

@pcuenca
Copy link
Member

pcuenca commented Apr 13, 2023

No problem. I am just trying to understand better to get your issue resolved.

I am trying to generate one single image with one single prompt at the same time. thanks.

Then, doesn't #2977 (comment) work?

@sayakpaul I think they want to generate a single image across 4 different GPUs. I don't think that's possible, as the process is iterative in nature.

@zetyquickly
Copy link
Contributor

@pcuenca can one parallel such iterative process MPI method, breaking one image to several workers?

@pcuenca
Copy link
Member

pcuenca commented Apr 25, 2023

@zetyquickly I don't know how to do it, unfortunately. Happy to listen to suggestions from developers in the community.

@Enderfga
Copy link

Enderfga commented May 8, 2023

I need to use 4 GPUs to infer during the traversal of dataloader, and then use the generated images for subsequent processing. In other words, I cannot use accelerate launch to call a single py file. Do you have any suggestions on how to solve this? @muellerzr @patrickvonplaten

@github-actions
Copy link
Contributor

github-actions bot commented Jun 1, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Jun 1, 2023
@github-actions github-actions bot closed this as completed Jun 9, 2023
@laelhalawani
Copy link

Stable Diffusion XL seems to be using something parallel to MoE. Maybe MoE based architectures can be offloaded to multiple GPUs more effectively?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

9 participants