Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Support for Flux #323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
extremeheat opened this issue Aug 1, 2024 · 32 comments
Open

Support for Flux #323

extremeheat opened this issue Aug 1, 2024 · 32 comments

Comments

@extremeheat
Copy link

New diffusion model - https://blackforestlabs.ai/announcing-black-forest-labs/

Reference implementation: comfyanonymous/ComfyUI@1589b58

@65a
Copy link

65a commented Aug 2, 2024

There's a reference diffusers config in a PR as well: https://huggingface.co/black-forest-labs/FLUX.1-dev/discussions/3/files

@stduhpf
Copy link
Contributor

stduhpf commented Aug 2, 2024

That's probably the best open model so far, but it's pretty big. I'd love to be able to use it quantized and with partial GPU offload.

@Green-Sky
Copy link
Contributor

The "schnell" distillation seems like a good candidate though.

That's probably the best open model so far, but it's pretty big. I'd love to be able to use it quantized and with partial GPU offload.

@diimdeep
Copy link

diimdeep commented Aug 4, 2024

12B is massive, lower than f16 quants could become more popular, f8 or even q5
(upd: corrected)

@extremeheat
Copy link
Author

extremeheat commented Aug 4, 2024

No, 12B parameters at fp32 is 12 * 4 bytes = 48GB memory. Not including clip/t5/vae/etc.
fp16 ~= 12 * 2 = 24GB memory
8 bit quant ~= 12 * 1= 12GB
4 bit quant ~= 12 * 0.5 = 6GB
5 bit ~= 12 * (5/8) = 7.5GB

@red-scorp
Copy link

That's probably the best open model so far, but it's pretty big. I'd love to be able to use it quantized and with partial GPU offload.

Can SD.cpp partially offload a model to GPU? I was unable to do this. Can you give a hint hot to do this?

  • One more upvote for FLUX support.

@red-scorp
Copy link

No, 12B parameters at fp32 is 12 * 4 bytes = 48GB memory. Not including clip/t5/vae/etc. fp16 ~= 12 * 2 = 24GB memory 8 bit quant ~= 12 * 1= 12GB 4 bit quant ~= 12 * 0.5 = 6GB 5 bit ~= 12 * (5/8) = 7.5GB

Can image generation models be quantized down to 4-5 bits? I saw it done to LLMs with mixed results, but never saw it working with SD and Co.

@SkutteOleg
Copy link
Contributor

SkutteOleg commented Aug 6, 2024

Can image generation models be quantized down to 4-5 bits? I saw it done to LLMs with mixed results, but never saw it working with SD and Co.

stable-diffusion.cpp can quantize models using --type command line argument

@DGdev91
Copy link

DGdev91 commented Aug 6, 2024

The "schnell" distillation seems like a good candidate though.

That's probably the best open model so far, but it's pretty big. I'd love to be able to use it quantized and with partial GPU offload.

Different stuff. It's more like the LCM models on SD, it takes less steps to make a picture, with minor quality loss over the standard version.

12B is massive, lower than f16 quants could become more popular, f8 or even q5 (upd: corrected)

There's already an unofficial FP8 version wich works on ComfyUI https://huggingface.co/Kijai/flux-fp8

But i don't know if comfy is able to do interference on more aggressive quantizations

Anyway.... +1 for flux support. Or for any new big image model.

As those model become bigger and bigger, running a quantized version could be the only way to run them on consumer-grade GPUs, and stable-diffusion.cpp could become a good solution.

@Amin456789
Copy link

indeed flux is amazing, after working with it i wanna use it with cpp too if possible

@FSSRepo if i remmeber correctly u were working on q2 and k variant, please work on it if u can, we can quantize it to something like 3-4 gb maybe
@leejet please support for lora for quantized models if possible, we really need for flux so we dont need to download and quantize new models all the time

@zhentaoyu zhentaoyu mentioned this issue Aug 12, 2024
@teddybear082
Copy link

this repo has q4 flux schnell: https://huggingface.co/city96/FLUX.1-schnell-gguf/tree/main. Agree would be great to get support, my computer is not even close to being able to run non quantized

@PierreSnell-Appox
Copy link

PierreSnell-Appox commented Aug 19, 2024

There are quantized files already (GGUF) claimed to have been made with stable-diffusion.cpp.
The smallest is only 4Gb but does not seems to work (gguf_init_from_file: tensor 'double_blocks.0.img_attn.norm.key_norm.scale' of type 10 (q2_K) number of elements (128) is not a multiple of block size (256))
The Q4_0 model (second smallest) is not working as well get sd version from file failed: '../../Downloads/flux1-schnell-Q4_0.gguf'

https://huggingface.co/aifoundry-org/FLUX.1-schnell-Quantized/tree/main
https://huggingface.co/city96/FLUX.1-dev-gguf/tree/main

However, the ability to have a small REST/API like Llama.cpp would be amazing to host thins kind of models !

@DGdev91
Copy link

DGdev91 commented Aug 20, 2024

Indeed, in the last weeks lots of developers made several attempts for using a quantized version of flux in both ComfyUI and StableDiffusion webui Forge.
Currently the most popular methods seems to be fp8 and nf4, but i've seen mamy experiments with gguf too.

However, the ability to have a small REST/API like Llama.cpp would be amazing to host thins kind of models !

Both ComfyUI and StableDiffusion Webui (well, of course Forge too) already have an API, and except from some plugin wich try to integrate them with program like Photoshop or Krita, i didn't saw many projects using them.

It can still be an interesting feature, but i don't think it should be the priority.
Also, it can be developed as a separate project.

@stduhpf
Copy link
Contributor

stduhpf commented Aug 20, 2024

The GGUF experiments aren't using proper ggml though. They are just using GGUF as a way to compress the weights, and they are dequantizing on the fly during inference, which is very inefficient.

@robolamp
Copy link

Hi all!

I am an author of this repo
https://huggingface.co/aifoundry-org/FLUX.1-schnell-Quantized/tree/main
Right now I'm trying to add FLUX support in this forked repo: https://github.com/robolamp/stable-diffusion-flux.cpp and if it'll work, I'd like to try to merge it back into stable-diffusion.cpp

Since SD3 is already supported and FLUX has close architecture (as far as I know at least), I hope it'll not be too complicated.

@leejet
Copy link
Owner

leejet commented Aug 21, 2024

Flux support has been added. #356

@leejet
Copy link
Owner

leejet commented Aug 21, 2024

Although the architecture is similar to sd3, flux actually has a lot of additional things to implement, so adding flux support took me a bit longer.

@MGTRIDER
Copy link

MGTRIDER commented Aug 21, 2024

Although the architecture is similar to sd3, flux actually has a lot of additional things to implement, so adding flux support took me a bit longer.

Hi there,thanks for the flux support. Has there been a noticeable difference in speed in your tests, in comparison with the compressed GGUF versions for other UI's?

@nonetrix
Copy link

Nice! Finally able to run it, don't have enough VRAM so really appreciate it. Imagine probably take forever to generate one image, but it's something at least

@Green-Sky
Copy link
Contributor

Green-Sky commented Aug 24, 2024

Support has been merge to master, so grab a latest release and give it a spin.
See docs/flux.md for how do run it.
Also check out https://huggingface.co/Green-Sky/flux.1-schnell-GGUF/ for some prequantized parts, I can recommend using q8_0 for basically lossless for unet aswell as t5xxl. Also the f16 vae (ae) seems to be perceptually lossless too.

@teddybear082
Copy link

teddybear082 commented Aug 24, 2024

is anyone else getting "error: unknown argument: --clip_1"? I'm using sd.exe for cuda12 and windows x64

EDIT: LOL it's --clip_l (lowercase l, not a 1)

@teddybear082
Copy link

This is awesome thank you!!! Used green sky's schnell q4_k, ae_f16, clip_l_q8, and t5xxl q4_k, with a rtx 3070.

output

@teddybear082
Copy link

Ok sorry this is probably going to sound stupid, but does running the exe directly each time take longer than some other method? It seems like a lot of the steps do not pertain to processing the specific prompt so I was wondering if there is a way to keep the models “in memory” so to speak in between prompts so that after the first generation, other generations during the same session go faster because they are not trying to do all the steps from scratch each time? I don’t know if I’m making any sense. But I’m comparing to, say, koboldcpp, where all the loading of a llama model happens at first and once it’s loaded which takes some extra time, all generations after that are pretty quick.

@stduhpf
Copy link
Contributor

stduhpf commented Aug 24, 2024

Ok sorry this is probably going to sound stupid, but does running the exe directly each time take longer than some other method? It seems like a lot of the steps do not pertain to processing the specific prompt so I was wondering if there is a way to keep the models “in memory” so to speak in between prompts so that after the first generation, other generations during the same session go faster because they are not trying to do all the steps from scratch each time? I don’t know if I’m making any sense. But I’m comparing to, say, koboldcpp, where all the loading of a llama model happens at first and once it’s loaded which takes some extra time, all generations after that are pretty quick.

I don't think so, sadly. You can do multiple renders with the same prompt by adding the -b [n] argument (replace [n] with the number of images you want). But if you want to use another prompt, you'd have to reload everything.

@0cc4m
Copy link

0cc4m commented Aug 25, 2024

Ok sorry this is probably going to sound stupid, but does running the exe directly each time take longer than some other method? It seems like a lot of the steps do not pertain to processing the specific prompt so I was wondering if there is a way to keep the models “in memory” so to speak in between prompts so that after the first generation, other generations during the same session go faster because they are not trying to do all the steps from scratch each time? I don’t know if I’m making any sense. But I’m comparing to, say, koboldcpp, where all the loading of a llama model happens at first and once it’s loaded which takes some extra time, all generations after that are pretty quick.

koboldcpp is a user program built on top of the library llama.cpp. You are looking for a user program built on top of this library, stable-diffusion.cpp. That's not the point of this repo, but someone could build this (or maybe already has?).

@yggdrasil75
Copy link

kobold.cpp is a user program built on top of this library.
it includes a stable diffusion ui, not as many features in it as comfyui or similar, but its usable.
in addition, you can generate directly from the normal koboldcpp lite main ui, or you can do it through most interfaces like sillytavern.

@ghost
Copy link

ghost commented Aug 28, 2024

I'm very sorry for my stupid question, but can someone explain why it's so slow when q8_0 or q4_k is used? About 18-19 sec per iteration, when fp8 model in ComfyUI was giving about 6-7 sec/it on the same GPU (RX 7600 XT)?
Edit: 1024x1024 resolution

@stduhpf
Copy link
Contributor

stduhpf commented Aug 28, 2024

I'm very sorry for my stupid question, but can someone explain why it's so slow when q8_0 or q4_k is used? About 18-19 sec per iteration, when fp8 model in ComfyUI was giving about 6-7 sec/it on the same GPU (RX 7600 XT)? Edit: 1024x1024 resolution

#323 (comment)

Using stable-diffusion.cpp should be much faster than ComfyUI when it comes to GGUF.

@ghost
Copy link

ghost commented Aug 28, 2024

I thought that comment was related to Comfy+GGUF which I don't try, I tried Comfy with fp8 model.

@stduhpf
Copy link
Contributor

stduhpf commented Aug 28, 2024

Ah I misunderstood what you meant. You're getting worse performance with stable-diffusion.cpp+GGUF compared to Comfy+fp8? Both using Rocm?

@ghost
Copy link

ghost commented Aug 28, 2024

Yes, on the same GPU.

@stduhpf
Copy link
Contributor

stduhpf commented Aug 28, 2024

Weird. For me this is faster, but I'm comparing Vulkan to DirectML, not Rocm to Rocm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests