Support for Flux #323

extremeheat · 2024-08-01T20:46:15Z

New diffusion model - https://blackforestlabs.ai/announcing-black-forest-labs/

Reference implementation: comfyanonymous/ComfyUI@1589b58

The text was updated successfully, but these errors were encountered:

65a · 2024-08-02T00:05:48Z

There's a reference diffusers config in a PR as well: https://huggingface.co/black-forest-labs/FLUX.1-dev/discussions/3/files

stduhpf · 2024-08-02T16:25:12Z

That's probably the best open model so far, but it's pretty big. I'd love to be able to use it quantized and with partial GPU offload.

Green-Sky · 2024-08-02T17:21:39Z

The "schnell" distillation seems like a good candidate though.

That's probably the best open model so far, but it's pretty big. I'd love to be able to use it quantized and with partial GPU offload.

diimdeep · 2024-08-04T13:59:15Z

12B is massive, lower than f16 quants could become more popular, f8 or even q5
(upd: corrected)

extremeheat · 2024-08-04T19:28:19Z

No, 12B parameters at fp32 is 12 * 4 bytes = 48GB memory. Not including clip/t5/vae/etc.
fp16 ~= 12 * 2 = 24GB memory
8 bit quant ~= 12 * 1= 12GB
4 bit quant ~= 12 * 0.5 = 6GB
5 bit ~= 12 * (5/8) = 7.5GB

red-scorp · 2024-08-06T12:41:23Z

That's probably the best open model so far, but it's pretty big. I'd love to be able to use it quantized and with partial GPU offload.

Can SD.cpp partially offload a model to GPU? I was unable to do this. Can you give a hint hot to do this?

One more upvote for FLUX support.

red-scorp · 2024-08-06T12:44:12Z

No, 12B parameters at fp32 is 12 * 4 bytes = 48GB memory. Not including clip/t5/vae/etc. fp16 ~= 12 * 2 = 24GB memory 8 bit quant ~= 12 * 1= 12GB 4 bit quant ~= 12 * 0.5 = 6GB 5 bit ~= 12 * (5/8) = 7.5GB

Can image generation models be quantized down to 4-5 bits? I saw it done to LLMs with mixed results, but never saw it working with SD and Co.

SkutteOleg · 2024-08-06T12:53:04Z

Can image generation models be quantized down to 4-5 bits? I saw it done to LLMs with mixed results, but never saw it working with SD and Co.

stable-diffusion.cpp can quantize models using --type command line argument

DGdev91 · 2024-08-06T17:39:55Z

The "schnell" distillation seems like a good candidate though.

That's probably the best open model so far, but it's pretty big. I'd love to be able to use it quantized and with partial GPU offload.

Different stuff. It's more like the LCM models on SD, it takes less steps to make a picture, with minor quality loss over the standard version.

12B is massive, lower than f16 quants could become more popular, f8 or even q5 (upd: corrected)

There's already an unofficial FP8 version wich works on ComfyUI https://huggingface.co/Kijai/flux-fp8

But i don't know if comfy is able to do interference on more aggressive quantizations

Anyway.... +1 for flux support. Or for any new big image model.

As those model become bigger and bigger, running a quantized version could be the only way to run them on consumer-grade GPUs, and stable-diffusion.cpp could become a good solution.

Amin456789 · 2024-08-09T13:06:48Z

indeed flux is amazing, after working with it i wanna use it with cpp too if possible

@FSSRepo if i remmeber correctly u were working on q2 and k variant, please work on it if u can, we can quantize it to something like 3-4 gb maybe
@leejet please support for lora for quantized models if possible, we really need for flux so we dont need to download and quantize new models all the time

teddybear082 · 2024-08-17T22:50:23Z

this repo has q4 flux schnell: https://huggingface.co/city96/FLUX.1-schnell-gguf/tree/main. Agree would be great to get support, my computer is not even close to being able to run non quantized

PierreSnell-Appox · 2024-08-19T20:09:41Z

There are quantized files already (GGUF) claimed to have been made with stable-diffusion.cpp.
The smallest is only 4Gb but does not seems to work (gguf_init_from_file: tensor 'double_blocks.0.img_attn.norm.key_norm.scale' of type 10 (q2_K) number of elements (128) is not a multiple of block size (256))
The Q4_0 model (second smallest) is not working as well get sd version from file failed: '../../Downloads/flux1-schnell-Q4_0.gguf'

https://huggingface.co/aifoundry-org/FLUX.1-schnell-Quantized/tree/main
https://huggingface.co/city96/FLUX.1-dev-gguf/tree/main

However, the ability to have a small REST/API like Llama.cpp would be amazing to host thins kind of models !

DGdev91 · 2024-08-20T08:29:49Z

Indeed, in the last weeks lots of developers made several attempts for using a quantized version of flux in both ComfyUI and StableDiffusion webui Forge.
Currently the most popular methods seems to be fp8 and nf4, but i've seen mamy experiments with gguf too.

However, the ability to have a small REST/API like Llama.cpp would be amazing to host thins kind of models !

Both ComfyUI and StableDiffusion Webui (well, of course Forge too) already have an API, and except from some plugin wich try to integrate them with program like Photoshop or Krita, i didn't saw many projects using them.

It can still be an interesting feature, but i don't think it should be the priority.
Also, it can be developed as a separate project.

stduhpf · 2024-08-20T08:42:57Z

The GGUF experiments aren't using proper ggml though. They are just using GGUF as a way to compress the weights, and they are dequantizing on the fly during inference, which is very inefficient.

robolamp · 2024-08-20T18:03:10Z

Hi all!

I am an author of this repo
https://huggingface.co/aifoundry-org/FLUX.1-schnell-Quantized/tree/main
Right now I'm trying to add FLUX support in this forked repo: https://github.com/robolamp/stable-diffusion-flux.cpp and if it'll work, I'd like to try to merge it back into stable-diffusion.cpp

Since SD3 is already supported and FLUX has close architecture (as far as I know at least), I hope it'll not be too complicated.

leejet · 2024-08-21T13:39:00Z

Flux support has been added. #356

leejet · 2024-08-21T13:46:44Z

Although the architecture is similar to sd3, flux actually has a lot of additional things to implement, so adding flux support took me a bit longer.

MGTRIDER · 2024-08-21T16:01:57Z

Although the architecture is similar to sd3, flux actually has a lot of additional things to implement, so adding flux support took me a bit longer.

Hi there,thanks for the flux support. Has there been a noticeable difference in speed in your tests, in comparison with the compressed GGUF versions for other UI's?

nonetrix · 2024-08-21T19:03:30Z

Nice! Finally able to run it, don't have enough VRAM so really appreciate it. Imagine probably take forever to generate one image, but it's something at least

Green-Sky · 2024-08-24T08:35:24Z

Support has been merge to master, so grab a latest release and give it a spin.
See docs/flux.md for how do run it.
Also check out https://huggingface.co/Green-Sky/flux.1-schnell-GGUF/ for some prequantized parts, I can recommend using q8_0 for basically lossless for unet aswell as t5xxl. Also the f16 vae (ae) seems to be perceptually lossless too.

teddybear082 · 2024-08-24T12:42:50Z

is anyone else getting "error: unknown argument: --clip_1"? I'm using sd.exe for cuda12 and windows x64

EDIT: LOL it's --clip_l (lowercase l, not a 1)

teddybear082 · 2024-08-24T13:10:49Z

This is awesome thank you!!! Used green sky's schnell q4_k, ae_f16, clip_l_q8, and t5xxl q4_k, with a rtx 3070.

teddybear082 · 2024-08-24T14:38:45Z

Ok sorry this is probably going to sound stupid, but does running the exe directly each time take longer than some other method? It seems like a lot of the steps do not pertain to processing the specific prompt so I was wondering if there is a way to keep the models “in memory” so to speak in between prompts so that after the first generation, other generations during the same session go faster because they are not trying to do all the steps from scratch each time? I don’t know if I’m making any sense. But I’m comparing to, say, koboldcpp, where all the loading of a llama model happens at first and once it’s loaded which takes some extra time, all generations after that are pretty quick.

stduhpf · 2024-08-24T14:46:56Z

Ok sorry this is probably going to sound stupid, but does running the exe directly each time take longer than some other method? It seems like a lot of the steps do not pertain to processing the specific prompt so I was wondering if there is a way to keep the models “in memory” so to speak in between prompts so that after the first generation, other generations during the same session go faster because they are not trying to do all the steps from scratch each time? I don’t know if I’m making any sense. But I’m comparing to, say, koboldcpp, where all the loading of a llama model happens at first and once it’s loaded which takes some extra time, all generations after that are pretty quick.

I don't think so, sadly. You can do multiple renders with the same prompt by adding the -b [n] argument (replace [n] with the number of images you want). But if you want to use another prompt, you'd have to reload everything.

0cc4m · 2024-08-25T10:22:38Z

Ok sorry this is probably going to sound stupid, but does running the exe directly each time take longer than some other method? It seems like a lot of the steps do not pertain to processing the specific prompt so I was wondering if there is a way to keep the models “in memory” so to speak in between prompts so that after the first generation, other generations during the same session go faster because they are not trying to do all the steps from scratch each time? I don’t know if I’m making any sense. But I’m comparing to, say, koboldcpp, where all the loading of a llama model happens at first and once it’s loaded which takes some extra time, all generations after that are pretty quick.

koboldcpp is a user program built on top of the library llama.cpp. You are looking for a user program built on top of this library, stable-diffusion.cpp. That's not the point of this repo, but someone could build this (or maybe already has?).

yggdrasil75 · 2024-08-25T11:19:12Z

kobold.cpp is a user program built on top of this library.
it includes a stable diffusion ui, not as many features in it as comfyui or similar, but its usable.
in addition, you can generate directly from the normal koboldcpp lite main ui, or you can do it through most interfaces like sillytavern.

ghost · 2024-08-28T07:46:11Z

I'm very sorry for my stupid question, but can someone explain why it's so slow when q8_0 or q4_k is used? About 18-19 sec per iteration, when fp8 model in ComfyUI was giving about 6-7 sec/it on the same GPU (RX 7600 XT)?
Edit: 1024x1024 resolution

stduhpf · 2024-08-28T09:59:56Z

I'm very sorry for my stupid question, but can someone explain why it's so slow when q8_0 or q4_k is used? About 18-19 sec per iteration, when fp8 model in ComfyUI was giving about 6-7 sec/it on the same GPU (RX 7600 XT)? Edit: 1024x1024 resolution

#323 (comment)

Using stable-diffusion.cpp should be much faster than ComfyUI when it comes to GGUF.

ghost · 2024-08-28T11:12:23Z

I thought that comment was related to Comfy+GGUF which I don't try, I tried Comfy with fp8 model.

stduhpf · 2024-08-28T11:15:01Z

Ah I misunderstood what you meant. You're getting worse performance with stable-diffusion.cpp+GGUF compared to Comfy+fp8? Both using Rocm?

ghost · 2024-08-28T11:32:05Z

Yes, on the same GPU.

stduhpf · 2024-08-28T11:36:51Z

Weird. For me this is faster, but I'm comparing Vulkan to DirectML, not Rocm to Rocm.

zhentaoyu mentioned this issue Aug 12, 2024

FLUX Support #340

Closed

Support for Flux #323

Support for Flux #323

Comments

extremeheat commented Aug 1, 2024

65a commented Aug 2, 2024

Uh oh!

stduhpf commented Aug 2, 2024

Uh oh!

Green-Sky commented Aug 2, 2024

Uh oh!

diimdeep commented Aug 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

extremeheat commented Aug 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

red-scorp commented Aug 6, 2024

Uh oh!

red-scorp commented Aug 6, 2024

Uh oh!

SkutteOleg commented Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DGdev91 commented Aug 6, 2024

Uh oh!

Amin456789 commented Aug 9, 2024

Uh oh!

teddybear082 commented Aug 17, 2024

Uh oh!

PierreSnell-Appox commented Aug 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DGdev91 commented Aug 20, 2024

Uh oh!

stduhpf commented Aug 20, 2024

Uh oh!

robolamp commented Aug 20, 2024

Uh oh!

leejet commented Aug 21, 2024

Uh oh!

leejet commented Aug 21, 2024

Uh oh!

MGTRIDER commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nonetrix commented Aug 21, 2024

Uh oh!

Green-Sky commented Aug 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

teddybear082 commented Aug 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

teddybear082 commented Aug 24, 2024

Uh oh!

teddybear082 commented Aug 24, 2024

Uh oh!

stduhpf commented Aug 24, 2024

Uh oh!

0cc4m commented Aug 25, 2024

Uh oh!

yggdrasil75 commented Aug 25, 2024

Uh oh!

ghost commented Aug 28, 2024 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stduhpf commented Aug 28, 2024

Uh oh!

ghost commented Aug 28, 2024

Uh oh!

stduhpf commented Aug 28, 2024

Uh oh!

ghost commented Aug 28, 2024

Uh oh!

stduhpf commented Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

diimdeep commented Aug 4, 2024 •

edited

Loading

extremeheat commented Aug 4, 2024 •

edited

Loading

SkutteOleg commented Aug 6, 2024 •

edited

Loading

PierreSnell-Appox commented Aug 19, 2024 •

edited

Loading

MGTRIDER commented Aug 21, 2024 •

edited

Loading

Green-Sky commented Aug 24, 2024 •

edited

Loading

teddybear082 commented Aug 24, 2024 •

edited

Loading

ghost commented Aug 28, 2024 •

edited by ghost

Loading

stduhpf commented Aug 28, 2024 •

edited

Loading