-
Notifications
You must be signed in to change notification settings - Fork 419
Optimize tensor loading #790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
e2c6c10
to
55b7707
Compare
The build is failing because I’m using |
ggml moved to c++17, so it would be reasonable to move sd.cpp too @leejet |
|
Cool, was about to implement the same thing myself as i was also thinking the loading is quite slow |
Sure, there's no problem with this. |
I have updated sd.cpp to C++17. |
I have added the statistics of tensor loading time #793.
It seems that the process time is not significant and is not a bottleneck. Therefore, I think there is no need to perform multi-threading on this process section, which will reduce the complexity of the code. |
Thanks! I've reverted the workaround
Can you try to mount a ramdisk and load the model from there? Will share some numbers soon Master: [INFO ] stable-diffusion.cpp:641 - total params memory size = 4145.07MB (VRAM 0.00MB, RAM 4145.07MB): text_encoders 1118.92MB(RAM), diffusion_model 2931.68MB(RAM), vae 94.47MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:660 - loading model from '/mnt/ramdisk/Diff-InstructStar_q8_0.gguf' completed, taking 2.34s PR: [INFO ] stable-diffusion.cpp:641 - total params memory size = 4145.07MB (VRAM 0.00MB, RAM 4145.07MB): text_encoders 1118.92MB(RAM), diffusion_model 2931.68MB(RAM), vae 94.47MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:660 - loading model from '/mnt/ramdisk/Diff-InstructStar_q8_0.gguf' completed, taking 1.01s |
Ramdisk result
|
Using a ramdisk didn't make much of a difference in speed. I'm not sure if it's because of the limitation of my memory bandwidth. |
You can mesure your memory bandiwth with AVG Method: MEMCPY Elapsed: 0.07953 MiB: 1024.00000 Copy: 12875.013 MiB/s If you’re already capped by your bandwidth, then this PR won’t make much difference |
It doesn't seem to be a problem with the memory bandwidth. Perhaps it might be a problem with my ramdisk software.
|
On my Ryzen 5 3400G, RX 7600 XT, SSD storage, Linux 6.12: Vulkan, cold cache
Vulkan, hot cache
PR, Vulkan, cold cache
PR, Vulkan, hot cache
For comparison: ROCm, cold cache
ROCm, hot cache
(I can also test the PR on ROCm, but it takes a long time to build here 😅 )
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far. Could we also get a clock reading between the preparation and the loading phase?
More numbers: Master: [INFO ] stable-diffusion.cpp:641 - total params memory size = 6751.89MB (VRAM 0.00MB, RAM 6751.89MB): text_encoders 1757.36MB(RAM), diffusion_model 4900.07MB(RAM), vae 94.47MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:660 - loading model from '/ramdisk/RealVisXL_V5.0_fp16.safetensors' completed, taking 4.61s PR: [INFO ] stable-diffusion.cpp:641 - total params memory size = 6751.89MB (VRAM 0.00MB, RAM 6751.89MB): text_encoders 1757.36MB(RAM), diffusion_model 4900.07MB(RAM), vae 94.47MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:660 - loading model from '/ramdisk/RealVisXL_V5.0_fp16.safetensors' completed, taking 1.34s |
I have merged the changes from the master branch into your branch https://github.com/leejet/stable-diffusion.cpp/commits/ref-tensor-loading/. If you don't mind, I can directly push it to your branch. These are some of my test data. Master:
PR:
|
Cool. Sure you can go ahead and push it! |
Based on the above test results, I think that multithreading has very little effect on the speed optimization of preprocess_tensor and dedup. I prefer to keep this part using the original single-threaded processing method, and only use multithreading in other parts. |
Have you compared dedicated model conversion numbers? or just supplying |
preprocess_tensor/dedup do not involve type conversion; they only deal with the handling of tensor names and dedup. This is data that includes type conversion.
|
Added
After 5 we are hitting the memory bandwidth |
@leejet the tensor display stat seems completely broken when it comes to lora [INFO ] model.cpp:2281 - loading tensors completed, taking 0.61s (process: 0.41s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[DEBUG] ggml_extend.hpp:1597 - lora params backend buffer size = 375.37 MB(VRAM) (2364 tensors)
[DEBUG] model.cpp:2042 - loading tensors from /ramdisk/dmd2_sdxl_4step_lora_fp16.safetensors
|==================================================| 2364/2364 - 17.24it/s
[INFO ] model.cpp:2281 - loading tensors completed, taking 0.09s (process: 0.04s, read: 0.02s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.05s) |
Cold cache: Hot cache: Tmpfs: As a baseline, this is 1 thread, cold/hot cache: Speed peaks at 4 threads here (4-core CPU). So, it looks like |
model.cpp
Outdated
return res; | ||
} | ||
|
||
bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb) { | ||
bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb, int n_threads_p) { | ||
int64_t process_time_ms = 0; | ||
int64_t read_time_ms = 0; | ||
int64_t memcpy_time_ms = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are being incremented by each thread, so... should at least be atomic?
And thinking about it, it may make more sense to have the time counters per-thread, and average the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are being incremented by each thread, so... should at least be atomic?
And thinking about it, it may make more sense to have the time counters per-thread, and average the results.
On paper it seems like a good idea but in practice it can be misleading. If you have 36 threads, it will bias the average when dividing by that. I'll push it anyway and if it's not desired I'll revert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with a simple sum is the total would end up bigger than the loading time. Correct, but very confusing :-)
A more... say, meaningful measure, could be averaging by time, regardless of the number of threads. Something like (total read time on all threads) / (total time on all threads), then multiplying that by the time measured by the main thread, to get the "total read time".
I tested the condition variable + mutex approach: f9a2adb The reading speed got a little bit slower: [INFO ] model.cpp:2324 - loading tensors completed, taking 2.13s (process: 0.05s, read: 0.74s, memcpy: 0.00s, convert: 0.12s, copy_to_backend: 1.04s) But using serialized reads made it a little bit faster: [INFO ] model.cpp:2324 - loading tensors completed, taking 1.94s (process: 0.04s, read: 0.69s, memcpy: 0.00s, convert: 0.03s, copy_to_backend: 1.03s) I didn't fix the time counters for the multi-thread updates, though, so I wouldn't put too much trust on those values. |
For lora it does: Before: [INFO ] stable-diffusion.cpp:848 - attempting to apply 1 LoRAs
[INFO ] model.cpp:1043 - load /ramdisk/NatsukiAoi ag4o.safetensors using safetensors format
[DEBUG] model.cpp:1150 - init from '/ramdisk/NatsukiAoi ag4o.safetensors', prefix = ''
[INFO ] lora.hpp:119 - loading LoRA from '/ramdisk/NatsukiAoi ag4o.safetensors'
[DEBUG] model.cpp:2042 - loading tensors from /ramdisk/NatsukiAoi ag4o.safetensors
|==================================================| 2166/2166 - 5.00it/s
[INFO ] model.cpp:2281 - loading tensors completed, taking 1.38s (process: 1.17s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[DEBUG] ggml_extend.hpp:1597 - lora params backend buffer size = 324.78 MB(VRAM) (2166 tensors)
[DEBUG] model.cpp:2042 - loading tensors from /ramdisk/NatsukiAoi ag4o.safetensors
|==================================================| 2166/2166 - 4.15it/s
[INFO ] model.cpp:2281 - loading tensors completed, taking 0.30s (process: 0.06s, read: 0.01s, memcpy: 0.00s, convert: 0.01s, copy_to_backend: 0.02s)
[DEBUG] lora.hpp:161 - lora type: ".lora_down"/".lora_up"
[DEBUG] lora.hpp:163 - finished loaded lora
[DEBUG] lora.hpp:860 - (2166 / 2166) LoRA tensors will be applied
[DEBUG] ggml_extend.hpp:1425 - lora compute buffer size: 101.56 MB(VRAM)
[DEBUG] lora.hpp:860 - (2166 / 2166) LoRA tensors will be applied
[INFO ] stable-diffusion.cpp:825 - lora 'NatsukiAoi ag4o' applied, taking 3.16s
[INFO ] stable-diffusion.cpp:868 - apply_loras completed, taking 3.16s After: [INFO ] stable-diffusion.cpp:848 - attempting to apply 1 LoRAs
[INFO ] model.cpp:1043 - load /ramdisk/NatsukiAoi ag4o.safetensors using safetensors format
[DEBUG] model.cpp:1150 - init from '/ramdisk/NatsukiAoi ag4o.safetensors', prefix = ''
[INFO ] lora.hpp:120 - loading LoRA from '/ramdisk/NatsukiAoi ag4o.safetensors'
[DEBUG] model.cpp:2042 - loading tensors from /ramdisk/NatsukiAoi ag4o.safetensors
|==================================================| 2166/2166 - 17.86it/s
[INFO ] model.cpp:2281 - loading tensors completed, taking 0.10s (process: 0.05s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[DEBUG] ggml_extend.hpp:1597 - lora params backend buffer size = 324.78 MB(VRAM) (2166 tensors)
[DEBUG] model.cpp:2042 - loading tensors from /ramdisk/NatsukiAoi ag4o.safetensors
|==================================================| 2166/2166 - 16.13it/s
[INFO ] model.cpp:2281 - loading tensors completed, taking 0.10s (process: 0.04s, read: 0.01s, memcpy: 0.00s, convert: 0.01s, copy_to_backend: 0.08s)
[DEBUG] lora.hpp:174 - lora type: ".lora_down"/".lora_up"
[DEBUG] lora.hpp:176 - finished loaded lora
[DEBUG] lora.hpp:873 - (2166 / 2166) LoRA tensors will be applied
[DEBUG] ggml_extend.hpp:1425 - lora compute buffer size: 101.56 MB(VRAM)
[DEBUG] lora.hpp:873 - (2166 / 2166) LoRA tensors will be applied
[INFO ] stable-diffusion.cpp:825 - lora 'NatsukiAoi ag4o' applied, taking 1.58s
[INFO ] stable-diffusion.cpp:868 - apply_loras completed, taking 1.58s |
} | ||
pretty_progress(total_tensors_processed + current_idx, total_tensors_to_process, (ggml_time_ms() - t_start) / 1000.0f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a reminder, since the original review got marked resolved: the progress here should display average time (note the (1000.0f * tensor_count)
on the original code).
Following the need here #772 and the discussion here #789
It acheives up to x3 faster loading on SDXL model
The PR introduce parallelization on the entire tensor processing and loading pipeline. Tensor preprocessing and deduplication are now distributed across a thread pool, using thread-local maps followed by a final merge to minimize contention. The core loading loop leverages an atomic counter to dispatch tensors to worker threads, each with its own file handle to enable true concurrent I/O on non-zip archives (this operation is not thread-safe on zip files), this will overlaps I/O and cpu works
cc @wbruna