How do we use multiple GPUs to generate a single image? #3392

JemiloII · 2023-05-11T06:08:26Z

I am trying to use multiple GPUs to generate a single image. Not batch images in parallel like #2977
I want my images to generate fast and not be bottlenecked by memory constants when I generate larger images or attempt in/out painting. I tried to use deepspeed; however, wsl2 is insanely slow and deepspeed just doesn't install the inference module I wanted to use to achieve the multigpu.

(Note: I can reinstall whatever is needed, I've just uninstalled and reinstalled so many versions to attempt to get deepspeed working I just gave up at this point.)

Current Environment:

Windows 10
Python 3.10
Cuda Toolkit 12.1
Torch 2.1.0
2 Nvidia RTX 4090
AMD 5950x
128 GB RAM
2 4tb nvme sn850x drives

patrickvonplaten · 2023-05-11T13:58:51Z

Why would one need multiple GPUs to generate 1 image?

JemiloII · 2023-05-11T15:14:18Z

To generate larger images that will run into memory constraints, to speed up these larger images, to use float64 with reasonable speed. To speed up in and out painting, to speed up additional loras or uses of control net. I'm focused on a single image. Not a batch.

alexisrolland · 2023-05-28T16:27:35Z

Actually I'm also interested in knowing how to run diffusers on multi-GPUs. Right now the Stable Diffusion x4 Upscaler is quite memory intensive and it does not run on one 12Gb vRAM GPU for images greater than 512x512.

JemiloII · 2023-06-22T01:55:50Z

@patrickvonplaten, any ideas on how we could achieve this?

patrickvonplaten · 2023-06-28T11:21:15Z

You can manually move components to different GPUs if you want, e.g.:
text encoder -> gpu 0
unet -> gpu 1
vae -> gpu 0

But overall with a RTX4090 you won't be bottlenecked by GPU memory normally

JemiloII · 2023-06-30T15:27:55Z

@patrickvonplaten well that's the thing, I will be at some point. Having multiple requests come and generate an image at the same time will start to stack up. Or do you think it won't? I don't want to batch images. I want like 20+ different people to hit my instance and generate images without having to wait for the previous image to complete.

So, loading the application twice on a single GPU eats 16 gigs of ram. Where loading one eats 8gigs.

Oddly, though, @patrickvonplaten, and maybe I'm incorrectly trying to load my instance to another gpu, but when I attempt to use a second GPU and put pipeline.to('cuda:1') it consumes half the ram, and only uses 4gb, but then takes little more than twice as long. It makes me wonder if how I normally load it. If it uses both GPUs to render, since I have sequentially set, and if so, how to set the other up to use my second GPU first, so ram can be used on it while still using both. However, the fact it only uses 4gigs of ram, makes me think it loaded the other 4 on my CPU.

So I just confirmed 4-5gigs is being loaded on my cpu/system ram. This makes things, well, a bit interesting since I don't have my system ram running at the actual speeds yet. it's running at 3600mhz, when it can go up to 5600mhz. I wonder if that would cut time down to something more reasonable. I'm not opposed to using some system ram if it means more instances can spawn. But I still feel like a better solution is having an instance load itself once on each GPU, and allow multiple images to be generated at once, like having multiple instances loaded on the same GPUs. It generates images at the same speed this way as if it only was generating one.

JemiloII · 2023-06-30T15:45:37Z

Okay, so an update, I think there was some kind of leak on GPU 0. I restarted my setup and both GPUs, 1 instance each, only consume 3gb each for the instance load and then up to 5.3~6gb on run, then idles to 4gb each.

JemiloII · 2023-06-30T17:06:56Z

#Testing
So here is a high-level breakdown of my setup.

7950X3D, 128GB, 2 4TB nvme 7100mbps (one dedicated to ai), 2 24GB RTX4090s

##Configuration 1:
Uses 1 GPU, 1 Instance. Allows 1 generation at a time.

GPU 0:

3gb load
5-6gb generating
4gb idle after generation.
3-4 second generation
device = 'cuda'
max generations at a time: 1

##Configuration 2:
Uses 2 GPUs, 2 Instances. Allows 2 generations at a time.

GPU 0:

3gb load
5-6gb generating
4gb idle after generation.
3-4 second generation
device = 'cuda'
max generations at a time: 1

GPU 1:

3gb load
5-6gb generating
4gb idle after generation.
3-4 second generation
device = 'cuda:1'
max generations at a time: 1

##Configuration 3:
Uses 1 GPU, 2 Instances. Allows 2 generations at a time.

GPU 0:

6gb load
11-12gb generating
8gb idle after generation.
8 second generation
device = 'cuda'
max generations at a time: 2

##Configuration 4:
Uses 2 GPUs, 4 Instances. Allows 4 generations at a time.

GPU 0:

3gb load
5-6gb generating
4gb idle after generation.
6-9 second generation
device = 'cuda'
max generations at a time: 2

GPU 1:

3gb load
5-6gb generating
4gb idle after generation.
19-21 second generation
device = 'cuda:1'
max generations at a time: 2

##Configuration 5:
Uses 2 GPUs, 4 Instances. Allows 4 generations at a time.

GPU 0:

3gb load
5-6gb generating
4gb idle after generation.
6-9 second generation
device = 'cuda:0'
max generations at a time: 2

GPU 1:

3gb load
5-6gb generating
4gb idle after generation.
19-21 second generation
device = 'cuda:1'
max generations at a time: 2

##Notes:
After running configuration 5, I'm going to have to check my setup and make sure that GPU 1 is in the right PCIE Lane. I believe it is, but seeing how this 4090 runs so much slower in cuda:1 is concerning. Running multiple instances does slow the generation time down. A bit concerning, but I guess to be expected. However, the amount of ram being used isn't really ideal but expected as I'm loading the model with each instance. Again, would be nice to not have to reload the model more than once for each GPU.

##Additional Test Information:
These are the settings used for each test run unless noted otherwise:

Model is folder based of Abyss Orange Mix 3
Textual Inversions are PT pickle files
Loras are safetensors

with torch.inference_mode(mode=True):
    with torch.no_grad():
        pipeline = StableDiffusionPipeline.from_pretrained(...)

Settings:

{
  "image": {
    "seed": 3844651158489267,
    "prompt": "2d, masterpiece, absurdres, best quality, anime, highly detailed, detailed eyes, detailed face, detailed background, perfect lighting, 1girl, solo, full body, (orange hair:1.2), orange pigtails, orange eyes, thigh gap, (beige vest:1.2), vest buttons, (red skirt:1.2), box skirt, red neck bow, blushing, small shadow, shy, breast, open smile, ambient light, hand on hips, looking up, symmetrical arms, (red hairbows:1.3), short sleeves, highres, high quality, beautiful, school rooftop, orange sunset, landscape",
    "negative_prompt": "<easynegative>, <bad_prompt>, <badhandv4>, (low quality, worst quality:1.4), 3d, realistic, photorealistic, (signature:2), (loli, child, teen, baby face), bad anatomy, bad hands, mutated hands and fingers, bad feet, bad face, anatomical nonsense, lowres, monochrome, monotone, greyscale, doujinshi, simple background, zombie, futanari, femboy, furry, animal, peeing, pee, scat, fat, censored, artist name, artist logo, watermark, different color thighhighs, extra hands, deformed, deformed hands, hands, drool, name tag, lipstick, nsfw, sex, blur, blurry",
    "width": 512,
    "height": 768,
    "guidance_scale": 8,
    "num_inference_steps": 35,
    "clip_skip": 2,
    "pil_filters": {
      "blur": false,
      "sharpen": true,
      "smooth": false,
      "smooth_more": false
    }
  },
  "textual_inversions": ["easynegative", "bad_prompt", "badhandv4"],
  "scheduler": {
    "name": "euler_a"
    "beta_start": 0.001775,
    "beta_end": 0.01,
    "beta_schedule": "linear",
    "num_train_timesteps": 935,
    "prediction_type": "epsilon"
  },
  "pipeline": {
    "pretrained_model_name_or_path": "AOM3",
    "torch_dtype": "float16",
    "device": "cuda",
    "device_map": "not used, will crash instance",
    "use_safetensors": true,
    "safety_checker": null,
    "requires_safety_checker": false,
    "local_files_only": true,
    "force_download": false
  },
  // Loaded with Safetensors using kohya_lora_loader by @takuma104 
  "loras": [
    {
      "display_name": "Detail Tweaker LoRA",
      "file_name": "add_detail",
      "weight": 0.05
    },
    {
      "display_name": "Sumiyao StyleG Lora",
      "file_name": "Sumiyao_StyleG",
      "weight": 0.25
    },
    {
      "display_name": "CUTE Flat Color + Lineart LORA",
      "file_name": "cutelineart",
      "weight": 0.25
    },
    {
      "display_name": "Thicker Lines Anime Style LoRA",
      "file_name": "thickline_fp16",
      "weight": 0.3
    },
    {
      "display_name": "beautiful detailed eyes, v1.0",
      "file_name": "beautiful_detailed_eyes",
      "weight": 0.5
    },
    {
      "display_name": "Squeezer LoRA",
      "file_name": "Squeezer2",
      "weight": -0.25
    },
    {
      "display_name": "School rooftop v0.1",
      "file_name": "school_rooftop_v0.1",
      "weight": 0.5
    }
  ],
}

JemiloII · 2023-06-30T17:28:12Z

So the issue with GPU 1 going slow is my motherboard; I'll have to get a new one to fix that issue. Apparently, GPU 0 is in PCIE5 slot running at PCI4 (expected), GPU 1 is in PCIE3 slot (wasn't expecting this)

elcolie · 2023-07-03T07:20:35Z

You can manually move components to different GPUs if you want, e.g.: text encoder -> gpu 0 unet -> gpu 1 vae -> gpu 0

But overall with a RTX4090 you won't be bottlenecked by GPU memory normally

@patrickvonplaten Suppose I have this snippet.

    controlnet = ControlNetModel.from_pretrained(f"lllyasviel/{control_model_name}").to(device)

    pipe = StableDiffusionControlNetPipeline.from_pretrained(
        "../flat2DAnimerge",
        safety_checker=None,
        controlnet=controlnet,
        local_files_only=True,
        low_cpu_mem_usage=False
        # cache_dir="./flat2DAnimerge"
    ).to(device)

I need to add .to(device) to the vae, text_encode, tokenizer, unit, controlnet, scheduler. Am I correct?

        self.register_modules(
            vae=vae,
            text_encoder=text_encoder,
            tokenizer=tokenizer,
            unet=unet,
            controlnet=controlnet,
            scheduler=scheduler,
            safety_checker=safety_checker,
            feature_extractor=feature_extractor,
        )

patrickvonplaten · 2023-07-04T17:11:22Z

Yes, we should maybe see if we can build something cool with https://huggingface.co/docs/accelerate/usage_guides/big_modeling going forward. For now if you want to run different components on different devices, they need to be placed manually

JemiloII · 2023-07-05T18:42:32Z

I don't really want to move components around @patrickvonplaten. I want it to utilize the same resources. If it could happen, I'd only load a pipeline once on a single GPU and have all GPUs use that pipeline and work in tantum to create an image. In the case of sharding, I'd ideally like a single GPU to still load the single pipeline; then, every GPU uses that pipeline to generate images. Allowing them to use different prompts and stay running, awaiting the next prompt request, not in batch. Each time the prompt is run, it is run insolation, so it doesn't affect any other running process and can be cleaned up after execution; still leaving the pipeline in memory. Batching is not useful for this. It's better to have it ready to accept prompts when prompts arrive rather than bunching the batches together for a single run. I'm really wanting a DRY approach here because I am currently just spinning up the same instance, which is the entire pipeline, multiple times.

github-actions · 2023-07-30T15:04:09Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot added the stale Issues that haven't received updates label Jul 30, 2023

github-actions bot closed this as completed Aug 8, 2023

suzukimain mentioned this issue Mar 18, 2025

Is there a way to generate a single image using multiple GPUs? #11108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do we use multiple GPUs to generate a single image? #3392

How do we use multiple GPUs to generate a single image? #3392

JemiloII commented May 11, 2023

patrickvonplaten commented May 11, 2023

JemiloII commented May 11, 2023

alexisrolland commented May 28, 2023

JemiloII commented Jun 22, 2023

patrickvonplaten commented Jun 28, 2023

JemiloII commented Jun 30, 2023 •

edited

Loading

JemiloII commented Jun 30, 2023 •

edited

Loading

JemiloII commented Jun 30, 2023

JemiloII commented Jun 30, 2023 •

edited

Loading

elcolie commented Jul 3, 2023

patrickvonplaten commented Jul 4, 2023

JemiloII commented Jul 5, 2023

github-actions bot commented Jul 30, 2023

How do we use multiple GPUs to generate a single image? #3392

How do we use multiple GPUs to generate a single image? #3392

Comments

JemiloII commented May 11, 2023

patrickvonplaten commented May 11, 2023

JemiloII commented May 11, 2023

alexisrolland commented May 28, 2023

JemiloII commented Jun 22, 2023

patrickvonplaten commented Jun 28, 2023

JemiloII commented Jun 30, 2023 • edited Loading

JemiloII commented Jun 30, 2023 • edited Loading

JemiloII commented Jun 30, 2023

JemiloII commented Jun 30, 2023 • edited Loading

elcolie commented Jul 3, 2023

patrickvonplaten commented Jul 4, 2023

JemiloII commented Jul 5, 2023

github-actions bot commented Jul 30, 2023

JemiloII commented Jun 30, 2023 •

edited

Loading

JemiloII commented Jun 30, 2023 •

edited

Loading

JemiloII commented Jun 30, 2023 •

edited

Loading