Thanks to visit codestin.com
Credit goes to github.com

Skip to content

How do we use multiple GPUs to generate a single image? #3392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JemiloII opened this issue May 11, 2023 · 13 comments
Closed

How do we use multiple GPUs to generate a single image? #3392

JemiloII opened this issue May 11, 2023 · 13 comments
Labels
stale Issues that haven't received updates

Comments

@JemiloII
Copy link

I am trying to use multiple GPUs to generate a single image. Not batch images in parallel like #2977
I want my images to generate fast and not be bottlenecked by memory constants when I generate larger images or attempt in/out painting. I tried to use deepspeed; however, wsl2 is insanely slow and deepspeed just doesn't install the inference module I wanted to use to achieve the multigpu.

(Note: I can reinstall whatever is needed, I've just uninstalled and reinstalled so many versions to attempt to get deepspeed working I just gave up at this point.)

Current Environment:

  • Windows 10
  • Python 3.10
  • Cuda Toolkit 12.1
  • Torch 2.1.0
  • 2 Nvidia RTX 4090
  • AMD 5950x
  • 128 GB RAM
  • 2 4tb nvme sn850x drives
@patrickvonplaten
Copy link
Contributor

Why would one need multiple GPUs to generate 1 image?

@JemiloII
Copy link
Author

To generate larger images that will run into memory constraints, to speed up these larger images, to use float64 with reasonable speed. To speed up in and out painting, to speed up additional loras or uses of control net. I'm focused on a single image. Not a batch.

@alexisrolland
Copy link
Contributor

Actually I'm also interested in knowing how to run diffusers on multi-GPUs. Right now the Stable Diffusion x4 Upscaler is quite memory intensive and it does not run on one 12Gb vRAM GPU for images greater than 512x512.

@JemiloII
Copy link
Author

@patrickvonplaten, any ideas on how we could achieve this?

@patrickvonplaten
Copy link
Contributor

You can manually move components to different GPUs if you want, e.g.:
text encoder -> gpu 0
unet -> gpu 1
vae -> gpu 0

But overall with a RTX4090 you won't be bottlenecked by GPU memory normally

@JemiloII
Copy link
Author

JemiloII commented Jun 30, 2023

@patrickvonplaten well that's the thing, I will be at some point. Having multiple requests come and generate an image at the same time will start to stack up. Or do you think it won't? I don't want to batch images. I want like 20+ different people to hit my instance and generate images without having to wait for the previous image to complete.

So, loading the application twice on a single GPU eats 16 gigs of ram. Where loading one eats 8gigs.

Oddly, though, @patrickvonplaten, and maybe I'm incorrectly trying to load my instance to another gpu, but when I attempt to use a second GPU and put pipeline.to('cuda:1') it consumes half the ram, and only uses 4gb, but then takes little more than twice as long. It makes me wonder if how I normally load it. If it uses both GPUs to render, since I have sequentially set, and if so, how to set the other up to use my second GPU first, so ram can be used on it while still using both. However, the fact it only uses 4gigs of ram, makes me think it loaded the other 4 on my CPU.

So I just confirmed 4-5gigs is being loaded on my cpu/system ram. This makes things, well, a bit interesting since I don't have my system ram running at the actual speeds yet. it's running at 3600mhz, when it can go up to 5600mhz. I wonder if that would cut time down to something more reasonable. I'm not opposed to using some system ram if it means more instances can spawn. But I still feel like a better solution is having an instance load itself once on each GPU, and allow multiple images to be generated at once, like having multiple instances loaded on the same GPUs. It generates images at the same speed this way as if it only was generating one.

@JemiloII
Copy link
Author

JemiloII commented Jun 30, 2023

Okay, so an update, I think there was some kind of leak on GPU 0. I restarted my setup and both GPUs, 1 instance each, only consume 3gb each for the instance load and then up to 5.3~6gb on run, then idles to 4gb each.

@JemiloII
Copy link
Author

#Testing
So here is a high-level breakdown of my setup.

7950X3D, 128GB, 2 4TB nvme 7100mbps (one dedicated to ai), 2 24GB RTX4090s

##Configuration 1:
Uses 1 GPU, 1 Instance. Allows 1 generation at a time.

GPU 0:

  • 3gb load
  • 5-6gb generating
  • 4gb idle after generation.
  • 3-4 second generation
  • device = 'cuda'
  • max generations at a time: 1

##Configuration 2:
Uses 2 GPUs, 2 Instances. Allows 2 generations at a time.

GPU 0:

  • 3gb load
  • 5-6gb generating
  • 4gb idle after generation.
  • 3-4 second generation
  • device = 'cuda'
  • max generations at a time: 1

GPU 1:

  • 3gb load
  • 5-6gb generating
  • 4gb idle after generation.
  • 3-4 second generation
  • device = 'cuda:1'
  • max generations at a time: 1

##Configuration 3:
Uses 1 GPU, 2 Instances. Allows 2 generations at a time.

GPU 0:

  • 6gb load
  • 11-12gb generating
  • 8gb idle after generation.
  • 8 second generation
  • device = 'cuda'
  • max generations at a time: 2

##Configuration 4:
Uses 2 GPUs, 4 Instances. Allows 4 generations at a time.

GPU 0:

  • 3gb load
  • 5-6gb generating
  • 4gb idle after generation.
  • 6-9 second generation
  • device = 'cuda'
  • max generations at a time: 2

GPU 1:

  • 3gb load
  • 5-6gb generating
  • 4gb idle after generation.
  • 19-21 second generation
  • device = 'cuda:1'
  • max generations at a time: 2

##Configuration 5:
Uses 2 GPUs, 4 Instances. Allows 4 generations at a time.

GPU 0:

  • 3gb load
  • 5-6gb generating
  • 4gb idle after generation.
  • 6-9 second generation
  • device = 'cuda:0'
  • max generations at a time: 2

GPU 1:

  • 3gb load
  • 5-6gb generating
  • 4gb idle after generation.
  • 19-21 second generation
  • device = 'cuda:1'
  • max generations at a time: 2

##Notes:
After running configuration 5, I'm going to have to check my setup and make sure that GPU 1 is in the right PCIE Lane. I believe it is, but seeing how this 4090 runs so much slower in cuda:1 is concerning. Running multiple instances does slow the generation time down. A bit concerning, but I guess to be expected. However, the amount of ram being used isn't really ideal but expected as I'm loading the model with each instance. Again, would be nice to not have to reload the model more than once for each GPU.

##Additional Test Information:
These are the settings used for each test run unless noted otherwise:

  • Model is folder based of Abyss Orange Mix 3
  • Textual Inversions are PT pickle files
  • Loras are safetensors
with torch.inference_mode(mode=True):
    with torch.no_grad():
        pipeline = StableDiffusionPipeline.from_pretrained(...)

Settings:

{
  "image": {
    "seed": 3844651158489267,
    "prompt": "2d, masterpiece, absurdres, best quality, anime, highly detailed, detailed eyes, detailed face, detailed background, perfect lighting, 1girl, solo, full body, (orange hair:1.2), orange pigtails, orange eyes, thigh gap, (beige vest:1.2), vest buttons, (red skirt:1.2), box skirt, red neck bow, blushing, small shadow, shy, breast, open smile, ambient light, hand on hips, looking up, symmetrical arms, (red hairbows:1.3), short sleeves, highres, high quality, beautiful, school rooftop, orange sunset, landscape",
    "negative_prompt": "<easynegative>, <bad_prompt>, <badhandv4>, (low quality, worst quality:1.4), 3d, realistic, photorealistic, (signature:2), (loli, child, teen, baby face), bad anatomy, bad hands, mutated hands and fingers, bad feet, bad face, anatomical nonsense, lowres, monochrome, monotone, greyscale, doujinshi, simple background, zombie, futanari, femboy, furry, animal, peeing, pee, scat, fat, censored, artist name, artist logo, watermark, different color thighhighs, extra hands, deformed, deformed hands, hands, drool, name tag, lipstick, nsfw, sex, blur, blurry",
    "width": 512,
    "height": 768,
    "guidance_scale": 8,
    "num_inference_steps": 35,
    "clip_skip": 2,
    "pil_filters": {
      "blur": false,
      "sharpen": true,
      "smooth": false,
      "smooth_more": false
    }
  },
  "textual_inversions": ["easynegative", "bad_prompt", "badhandv4"],
  "scheduler": {
    "name": "euler_a"
    "beta_start": 0.001775,
    "beta_end": 0.01,
    "beta_schedule": "linear",
    "num_train_timesteps": 935,
    "prediction_type": "epsilon"
  },
  "pipeline": {
    "pretrained_model_name_or_path": "AOM3",
    "torch_dtype": "float16",
    "device": "cuda",
    "device_map": "not used, will crash instance",
    "use_safetensors": true,
    "safety_checker": null,
    "requires_safety_checker": false,
    "local_files_only": true,
    "force_download": false
  },
  // Loaded with Safetensors using kohya_lora_loader by @takuma104 
  "loras": [
    {
      "display_name": "Detail Tweaker LoRA",
      "file_name": "add_detail",
      "weight": 0.05
    },
    {
      "display_name": "Sumiyao StyleG Lora",
      "file_name": "Sumiyao_StyleG",
      "weight": 0.25
    },
    {
      "display_name": "CUTE Flat Color + Lineart LORA",
      "file_name": "cutelineart",
      "weight": 0.25
    },
    {
      "display_name": "Thicker Lines Anime Style LoRA",
      "file_name": "thickline_fp16",
      "weight": 0.3
    },
    {
      "display_name": "beautiful detailed eyes, v1.0",
      "file_name": "beautiful_detailed_eyes",
      "weight": 0.5
    },
    {
      "display_name": "Squeezer LoRA",
      "file_name": "Squeezer2",
      "weight": -0.25
    },
    {
      "display_name": "School rooftop v0.1",
      "file_name": "school_rooftop_v0.1",
      "weight": 0.5
    }
  ],
}

@JemiloII
Copy link
Author

JemiloII commented Jun 30, 2023

So the issue with GPU 1 going slow is my motherboard; I'll have to get a new one to fix that issue. Apparently, GPU 0 is in PCIE5 slot running at PCI4 (expected), GPU 1 is in PCIE3 slot (wasn't expecting this)

@elcolie
Copy link

elcolie commented Jul 3, 2023

You can manually move components to different GPUs if you want, e.g.: text encoder -> gpu 0 unet -> gpu 1 vae -> gpu 0

But overall with a RTX4090 you won't be bottlenecked by GPU memory normally

@patrickvonplaten Suppose I have this snippet.

    controlnet = ControlNetModel.from_pretrained(f"lllyasviel/{control_model_name}").to(device)

    pipe = StableDiffusionControlNetPipeline.from_pretrained(
        "../flat2DAnimerge",
        safety_checker=None,
        controlnet=controlnet,
        local_files_only=True,
        low_cpu_mem_usage=False
        # cache_dir="./flat2DAnimerge"
    ).to(device)

I need to add .to(device) to the vae, text_encode, tokenizer, unit, controlnet, scheduler. Am I correct?

        self.register_modules(
            vae=vae,
            text_encoder=text_encoder,
            tokenizer=tokenizer,
            unet=unet,
            controlnet=controlnet,
            scheduler=scheduler,
            safety_checker=safety_checker,
            feature_extractor=feature_extractor,
        )

@patrickvonplaten
Copy link
Contributor

Yes, we should maybe see if we can build something cool with https://huggingface.co/docs/accelerate/usage_guides/big_modeling going forward. For now if you want to run different components on different devices, they need to be placed manually

@JemiloII
Copy link
Author

JemiloII commented Jul 5, 2023

I don't really want to move components around @patrickvonplaten. I want it to utilize the same resources. If it could happen, I'd only load a pipeline once on a single GPU and have all GPUs use that pipeline and work in tantum to create an image. In the case of sharding, I'd ideally like a single GPU to still load the single pipeline; then, every GPU uses that pipeline to generate images. Allowing them to use different prompts and stay running, awaiting the next prompt request, not in batch. Each time the prompt is run, it is run insolation, so it doesn't affect any other running process and can be cleaned up after execution; still leaving the pipeline in memory. Batching is not useful for this. It's better to have it ready to accept prompts when prompts arrive rather than bunching the batches together for a single run. I'm really wanting a DRY approach here because I am currently just spinning up the same instance, which is the entire pipeline, multiple times.

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Jul 30, 2023
@github-actions github-actions bot closed this as completed Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

4 participants