Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

rmatif
Copy link
Contributor

@rmatif rmatif commented Sep 6, 2025

Following the need here #772 and the discussion here #789

It acheives up to x3 faster loading on SDXL model

The PR introduce parallelization on the entire tensor processing and loading pipeline. Tensor preprocessing and deduplication are now distributed across a thread pool, using thread-local maps followed by a final merge to minimize contention. The core loading loop leverages an atomic counter to dispatch tensors to worker threads, each with its own file handle to enable true concurrent I/O on non-zip archives (this operation is not thread-safe on zip files), this will overlaps I/O and cpu works

cc @wbruna

@rmatif rmatif force-pushed the ref-tensor-loading branch from e2c6c10 to 55b7707 Compare September 6, 2025 17:16
@rmatif
Copy link
Contributor Author

rmatif commented Sep 6, 2025

The build is failing because I’m using std::unordered_map::merge and structured bindings, which are C++17 features and not available on the CI compiler. I need to figure out a workaround without sacrificing perf

@Green-Sky
Copy link
Contributor

ggml moved to c++17, so it would be reasonable to move sd.cpp too @leejet

@rmatif
Copy link
Contributor Author

rmatif commented Sep 6, 2025

ggml moved to c++17, so it would be reasonable to move sd.cpp too @leejet

I was just about pressing enter to say the same :)
c++17

@hartmark
Copy link
Contributor

hartmark commented Sep 7, 2025

Cool, was about to implement the same thing myself as i was also thinking the loading is quite slow

@leejet
Copy link
Owner

leejet commented Sep 7, 2025

ggml moved to c++17, so it would be reasonable to move sd.cpp too @leejet

Sure, there's no problem with this.

@leejet
Copy link
Owner

leejet commented Sep 7, 2025

I have updated sd.cpp to C++17.

@leejet
Copy link
Owner

leejet commented Sep 7, 2025

I have added the statistics of tensor loading time #793.

> .\bin\Release\sd.exe -m ..\..\stable-diffusion-webui\models\Stable-diffusion\sd_xl_base_1.0.safetensors --vae ..\..\stable-diffusion-webui\models\VAE\sdxl_vae-fp16-fix.safetensors -p "a lovely cat" -v   -H 1024 -W 1024 --diffusion-fa

loading tensors completed, taking 6.39s (process: 0.03s, read: 4.68s, memcpy: 0.00s, convert: 0.30s, copy_to_backend: 1.16s)

It seems that the process time is not significant and is not a bottleneck. Therefore, I think there is no need to perform multi-threading on this process section, which will reduce the complexity of the code.

@rmatif
Copy link
Contributor Author

rmatif commented Sep 7, 2025

I have updated sd.cpp to C++17.

Thanks! I've reverted the workaround

It seems that the process time is not significant and is not a bottleneck. Therefore, I think there is no need to perform multi-threading on this process section, which will reduce the complexity of the code

Can you try to mount a ramdisk and load the model from there? Will share some numbers soon

Master:

[INFO ] stable-diffusion.cpp:641  - total params memory size = 4145.07MB (VRAM 0.00MB, RAM 4145.07MB): text_encoders 1118.92MB(RAM), diffusion_model 2931.68MB(RAM), vae 94.47MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:660  - loading model from '/mnt/ramdisk/Diff-InstructStar_q8_0.gguf' completed, taking 2.34s

PR:

[INFO ] stable-diffusion.cpp:641  - total params memory size = 4145.07MB (VRAM 0.00MB, RAM 4145.07MB): text_encoders 1118.92MB(RAM), diffusion_model 2931.68MB(RAM), vae 94.47MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:660  - loading model from '/mnt/ramdisk/Diff-InstructStar_q8_0.gguf' completed, taking 1.01s

@leejet
Copy link
Owner

leejet commented Sep 7, 2025

Ramdisk result

loading tensors completed, taking 5.85s (process: 0.01s, read: 4.28s, memcpy: 0.00s, convert: 0.19s, copy_to_backend: 1.15s)

@leejet
Copy link
Owner

leejet commented Sep 7, 2025

Using a ramdisk didn't make much of a difference in speed. I'm not sure if it's because of the limitation of my memory bandwidth.

@rmatif
Copy link
Contributor Author

rmatif commented Sep 7, 2025

Using a ramdisk didn't make much of a difference in speed. I'm not sure if it's because of the limitation of my memory bandwidth.

You can mesure your memory bandiwth with mbw. Here's mine:

AVG     Method: MEMCPY  Elapsed: 0.07953        MiB: 1024.00000 Copy: 12875.013 MiB/s

If you’re already capped by your bandwidth, then this PR won’t make much difference

@leejet
Copy link
Owner

leejet commented Sep 7, 2025

It doesn't seem to be a problem with the memory bandwidth. Perhaps it might be a problem with my ramdisk software.

AVG     Method: MEMCPY  Elapsed: 1.09857        MiB: 10000.00000        Copy: 9102.753 MiB/s

@wbruna
Copy link
Contributor

wbruna commented Sep 7, 2025

On my Ryzen 5 3400G, RX 7600 XT, SSD storage, Linux 6.12:

Vulkan, cold cache

[INFO ] model.cpp:2216 - loading tensors completed, taking 16.97s (process: 0.12s, read: 11.60s, memcpy: 0.00s, convert: 0.30s, copy_to_backend: 4.74s)
[INFO ] stable-diffusion.cpp:641 - total params memory size = 6751.89MB (VRAM 6751.89MB, RAM 0.00MB): text_encoders 1757.36MB(VRAM), diffusion_model 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:660 - loading model from './cyberrealisticXL_v60.safetensors' completed, taking 16.97s

Vulkan, hot cache

[INFO ] model.cpp:2216 - loading tensors completed, taking 6.04s (process: 0.12s, read: 0.95s, memcpy: 0.00s, convert: 0.31s, copy_to_backend: 4.45s)
[INFO ] stable-diffusion.cpp:641 - total params memory size = 6751.89MB (VRAM 6751.89MB, RAM 0.00MB): text_encoders 1757.36MB(VRAM), diffusion_model 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:660 - loading model from './cyberrealisticXL_v60.safetensors' completed, taking 6.04s

PR, Vulkan, cold cache

[INFO ] stable-diffusion.cpp:641 - total params memory size = 6751.89MB (VRAM 6751.89MB, RAM 0.00MB): text_encoders 1757.36MB(VRAM), diffusion_model 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:660 - loading model from './cyberrealisticXL_v60.safetensors' completed, taking 16.08s

PR, Vulkan, hot cache

[INFO ] stable-diffusion.cpp:641 - total params memory size = 6751.89MB (VRAM 6751.89MB, RAM 0.00MB): text_encoders 1757.36MB(VRAM), diffusion_model 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:660 - loading model from './cyberrealisticXL_v60.safetensors' completed, taking 2.08s

For comparison:

ROCm, cold cache

[INFO ] model.cpp:2216 - loading tensors completed, taking 15.55s (process: 0.11s, read: 11.90s, memcpy: 0.00s, convert: 0.32s, copy_to_backend: 2.98s)
[INFO ] stable-diffusion.cpp:641 - total params memory size = 6751.89MB (VRAM 6751.89MB, RAM 0.00MB): text_encoders 1757.36MB(VRAM), diffusion_model 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:660 - loading model from './cyberrealisticXL_v60.safetensors' completed, taking 15.55s

ROCm, hot cache

[INFO ] model.cpp:2216 - loading tensors completed, taking 4.54s (process: 0.11s, read: 1.36s, memcpy: 0.00s, convert: 0.33s, copy_to_backend: 2.53s)
[INFO ] stable-diffusion.cpp:641 - total params memory size = 6751.89MB (VRAM 6751.89MB, RAM 0.00MB): text_encoders 1757.36MB(VRAM), diffusion_model 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:660 - loading model from './cyberrealisticXL_v60.safetensors' completed, taking 4.54s

(I can also test the PR on ROCm, but it takes a long time to build here 😅 )

$ mbw -t0 -q 1024
AVG Method: MEMCPY Elapsed: 0.13855 MiB: 1024.00000 Copy: 7390.958 MiB/s

Copy link
Contributor

@wbruna wbruna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far. Could we also get a clock reading between the preparation and the loading phase?

@rmatif
Copy link
Contributor Author

rmatif commented Sep 7, 2025

More numbers:

Master:

[INFO ] stable-diffusion.cpp:641  - total params memory size = 6751.89MB (VRAM 0.00MB, RAM 6751.89MB): text_encoders 1757.36MB(RAM), diffusion_model 4900.07MB(RAM), vae 94.47MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:660  - loading model from '/ramdisk/RealVisXL_V5.0_fp16.safetensors' completed, taking 4.61s

PR:

[INFO ] stable-diffusion.cpp:641  - total params memory size = 6751.89MB (VRAM 0.00MB, RAM 6751.89MB): text_encoders 1757.36MB(RAM), diffusion_model 4900.07MB(RAM), vae 94.47MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:660  - loading model from '/ramdisk/RealVisXL_V5.0_fp16.safetensors' completed, taking 1.34s

@leejet
Copy link
Owner

leejet commented Sep 7, 2025

I have merged the changes from the master branch into your branch https://github.com/leejet/stable-diffusion.cpp/commits/ref-tensor-loading/. If you don't mind, I can directly push it to your branch. These are some of my test data.

Master:

loading tensors completed, taking 6.39s (process: 0.03s, read: 4.68s, memcpy: 0.00s, convert: 0.30s, copy_to_backend: 1.16s)

PR:

loading tensors completed, taking 2.71s (process: 0.02s, read: 1.04s, memcpy: 0.00s, convert: 0.04s, copy_to_backend: 1.50s)

@rmatif
Copy link
Contributor Author

rmatif commented Sep 7, 2025

If you don't mind, I can directly push it to your branch. These are some of my test data.

Cool. Sure you can go ahead and push it!

@leejet
Copy link
Owner

leejet commented Sep 7, 2025

I have merged the changes from the master branch into your branch https://github.com/leejet/stable-diffusion.cpp/commits/ref-tensor-loading/. If you don't mind, I can directly push it to your branch. These are some of my test data.

Master:

loading tensors completed, taking 6.39s (process: 0.03s, read: 4.68s, memcpy: 0.00s, convert: 0.30s, copy_to_backend: 1.16s)

PR:

loading tensors completed, taking 2.71s (process: 0.02s, read: 1.04s, memcpy: 0.00s, convert: 0.04s, copy_to_backend: 1.50s)

Based on the above test results, I think that multithreading has very little effect on the speed optimization of preprocess_tensor and dedup. I prefer to keep this part using the original single-threaded processing method, and only use multithreading in other parts.

@Green-Sky
Copy link
Contributor

Based on the above test results, I think that multithreading has very little effect on the speed optimization of preprocess_tensor and dedup. I prefer to keep this part using the original single-threaded processing method, and only use multithreading in other parts.

Have you compared dedicated model conversion numbers? or just supplying --type with eg q5_k ?

@leejet
Copy link
Owner

leejet commented Sep 7, 2025

Have you compared dedicated model conversion numbers? or just supplying --type with eg q5_k ?

preprocess_tensor/dedup do not involve type conversion; they only deal with the handling of tensor names and dedup. This is data that includes type conversion.

loading tensors completed, taking 212.47s (process: 0.02s, read: 8.64s, memcpy: 0.00s, convert: 201.98s, copy_to_backend: 1.63s)

@rmatif
Copy link
Contributor Author

rmatif commented Sep 7, 2025

Added n_threads override

Number of threads Load Time for SDXL model
t = 1 4.82s
t = 2 2.62s
t = 3 1.81s
t = 4 1.49s
t = 5 1.30s

After 5 we are hitting the memory bandwidth

@rmatif
Copy link
Contributor Author

rmatif commented Sep 7, 2025

@leejet the tensor display stat seems completely broken when it comes to lora

[INFO ] model.cpp:2281 - loading tensors completed, taking 0.61s (process: 0.41s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[DEBUG] ggml_extend.hpp:1597 - lora params backend buffer size =  375.37 MB(VRAM) (2364 tensors)
[DEBUG] model.cpp:2042 - loading tensors from /ramdisk/dmd2_sdxl_4step_lora_fp16.safetensors
  |==================================================| 2364/2364 - 17.24it/s

[INFO ] model.cpp:2281 - loading tensors completed, taking 0.09s (process: 0.04s, read: 0.02s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.05s)

@wbruna
Copy link
Contributor

wbruna commented Sep 8, 2025

Cold cache:
[INFO ] model.cpp:2281 - loading tensors completed, taking 16.65s (process: 0.04s, read: 12.23s, memcpy: 0.00s, convert: 0.14s, copy_to_backend: 3.85s)

Hot cache:
[INFO ] model.cpp:2281 - loading tensors completed, taking 2.06s (process: 0.04s, read: 0.71s, memcpy: 0.00s, convert: 0.04s, copy_to_backend: 0.94s)

Tmpfs:
[INFO ] model.cpp:2281 - loading tensors completed, taking 2.04s (process: 0.04s, read: 0.68s, memcpy: 0.00s, convert: 0.04s, copy_to_backend: 0.91s)

As a baseline, this is 1 thread, cold/hot cache:
[INFO ] model.cpp:2281 - loading tensors completed, taking 20.34s (process: 0.12s, read: 14.45s, memcpy: 0.00s, convert: 0.33s, copy_to_backend: 5.00s)
[INFO ] model.cpp:2281 - loading tensors completed, taking 6.73s (process: 0.12s, read: 1.30s, memcpy: 0.00s, convert: 0.26s, copy_to_backend: 4.52s)

Speed peaks at 4 threads here (4-core CPU).

So, it looks like the parallel reads are in fact helping a bit- (edit: see below). And depending on the system, a ramdisk could be pointless.

model.cpp Outdated
return res;
}

bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb) {
bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb, int n_threads_p) {
int64_t process_time_ms = 0;
int64_t read_time_ms = 0;
int64_t memcpy_time_ms = 0;
Copy link
Contributor

@wbruna wbruna Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are being incremented by each thread, so... should at least be atomic?

And thinking about it, it may make more sense to have the time counters per-thread, and average the results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are being incremented by each thread, so... should at least be atomic?

And thinking about it, it may make more sense to have the time counters per-thread, and average the results.

On paper it seems like a good idea but in practice it can be misleading. If you have 36 threads, it will bias the average when dividing by that. I'll push it anyway and if it's not desired I'll revert

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with a simple sum is the total would end up bigger than the loading time. Correct, but very confusing :-)

A more... say, meaningful measure, could be averaging by time, regardless of the number of threads. Something like (total read time on all threads) / (total time on all threads), then multiplying that by the time measured by the main thread, to get the "total read time".

@wbruna
Copy link
Contributor

wbruna commented Sep 8, 2025

I tested the condition variable + mutex approach: f9a2adb

The reading speed got a little bit slower:

[INFO ] model.cpp:2324 - loading tensors completed, taking 2.13s (process: 0.05s, read: 0.74s, memcpy: 0.00s, convert: 0.12s, copy_to_backend: 1.04s)

But using serialized reads made it a little bit faster:

[INFO ] model.cpp:2324 - loading tensors completed, taking 1.94s (process: 0.04s, read: 0.69s, memcpy: 0.00s, convert: 0.03s, copy_to_backend: 1.03s)

I didn't fix the time counters for the multi-thread updates, though, so I wouldn't put too much trust on those values.

@rmatif
Copy link
Contributor Author

rmatif commented Sep 8, 2025

Based on the above test results, I think that multithreading has very little effect on the speed optimization of preprocess_tensor and dedup. I prefer to keep this part using the original single-threaded processing method, and only use multithreading in other parts.

For lora it does:

Before:

[INFO ] stable-diffusion.cpp:848  - attempting to apply 1 LoRAs
[INFO ] model.cpp:1043 - load /ramdisk/NatsukiAoi ag4o.safetensors using safetensors format
[DEBUG] model.cpp:1150 - init from '/ramdisk/NatsukiAoi ag4o.safetensors', prefix = ''
[INFO ] lora.hpp:119  - loading LoRA from '/ramdisk/NatsukiAoi ag4o.safetensors'
[DEBUG] model.cpp:2042 - loading tensors from /ramdisk/NatsukiAoi ag4o.safetensors
  |==================================================| 2166/2166 - 5.00it/s

[INFO ] model.cpp:2281 - loading tensors completed, taking 1.38s (process: 1.17s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[DEBUG] ggml_extend.hpp:1597 - lora params backend buffer size =  324.78 MB(VRAM) (2166 tensors)
[DEBUG] model.cpp:2042 - loading tensors from /ramdisk/NatsukiAoi ag4o.safetensors
  |==================================================| 2166/2166 - 4.15it/s

[INFO ] model.cpp:2281 - loading tensors completed, taking 0.30s (process: 0.06s, read: 0.01s, memcpy: 0.00s, convert: 0.01s, copy_to_backend: 0.02s)
[DEBUG] lora.hpp:161  - lora type: ".lora_down"/".lora_up"
[DEBUG] lora.hpp:163  - finished loaded lora
[DEBUG] lora.hpp:860  - (2166 / 2166) LoRA tensors will be applied
[DEBUG] ggml_extend.hpp:1425 - lora compute buffer size: 101.56 MB(VRAM)
[DEBUG] lora.hpp:860  - (2166 / 2166) LoRA tensors will be applied
[INFO ] stable-diffusion.cpp:825  - lora 'NatsukiAoi ag4o' applied, taking 3.16s
[INFO ] stable-diffusion.cpp:868  - apply_loras completed, taking 3.16s

After:

[INFO ] stable-diffusion.cpp:848  - attempting to apply 1 LoRAs
[INFO ] model.cpp:1043 - load /ramdisk/NatsukiAoi ag4o.safetensors using safetensors format
[DEBUG] model.cpp:1150 - init from '/ramdisk/NatsukiAoi ag4o.safetensors', prefix = ''
[INFO ] lora.hpp:120  - loading LoRA from '/ramdisk/NatsukiAoi ag4o.safetensors'
[DEBUG] model.cpp:2042 - loading tensors from /ramdisk/NatsukiAoi ag4o.safetensors
  |==================================================| 2166/2166 - 17.86it/s

[INFO ] model.cpp:2281 - loading tensors completed, taking 0.10s (process: 0.05s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[DEBUG] ggml_extend.hpp:1597 - lora params backend buffer size =  324.78 MB(VRAM) (2166 tensors)
[DEBUG] model.cpp:2042 - loading tensors from /ramdisk/NatsukiAoi ag4o.safetensors
  |==================================================| 2166/2166 - 16.13it/s

[INFO ] model.cpp:2281 - loading tensors completed, taking 0.10s (process: 0.04s, read: 0.01s, memcpy: 0.00s, convert: 0.01s, copy_to_backend: 0.08s)
[DEBUG] lora.hpp:174  - lora type: ".lora_down"/".lora_up"
[DEBUG] lora.hpp:176  - finished loaded lora
[DEBUG] lora.hpp:873  - (2166 / 2166) LoRA tensors will be applied
[DEBUG] ggml_extend.hpp:1425 - lora compute buffer size: 101.56 MB(VRAM)
[DEBUG] lora.hpp:873  - (2166 / 2166) LoRA tensors will be applied
[INFO ] stable-diffusion.cpp:825  - lora 'NatsukiAoi ag4o' applied, taking 1.58s
[INFO ] stable-diffusion.cpp:868  - apply_loras completed, taking 1.58s

}
pretty_progress(total_tensors_processed + current_idx, total_tensors_to_process, (ggml_time_ms() - t_start) / 1000.0f);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder, since the original review got marked resolved: the progress here should display average time (note the (1000.0f * tensor_count) on the original code).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants