Optimize tensor loading #790

rmatif · 2025-09-06T14:01:43Z

Following the need here #772 and the discussion here #789

It acheives up to x3 faster loading on SDXL model

The PR introduce parallelization on the entire tensor processing and loading pipeline. Tensor preprocessing and deduplication are now distributed across a thread pool, using thread-local maps followed by a final merge to minimize contention. The core loading loop leverages an atomic counter to dispatch tensors to worker threads, each with its own file handle to enable true concurrent I/O on non-zip archives (this operation is not thread-safe on zip files), this will overlaps I/O and cpu works

cc @wbruna

rmatif · 2025-09-06T20:12:46Z

The build is failing because I’m using std::unordered_map::merge and structured bindings, which are C++17 features and not available on the CI compiler. I need to figure out a workaround without sacrificing perf

Green-Sky · 2025-09-06T21:35:47Z

ggml moved to c++17, so it would be reasonable to move sd.cpp too @leejet

rmatif · 2025-09-06T21:38:19Z

ggml moved to c++17, so it would be reasonable to move sd.cpp too @leejet

I was just about pressing enter to say the same :)

hartmark · 2025-09-07T02:47:25Z

Cool, was about to implement the same thing myself as i was also thinking the loading is quite slow

leejet · 2025-09-07T03:28:23Z

ggml moved to c++17, so it would be reasonable to move sd.cpp too @leejet

Sure, there's no problem with this.

leejet · 2025-09-07T04:08:29Z

I have updated sd.cpp to C++17.

leejet · 2025-09-07T04:55:18Z

I have added the statistics of tensor loading time #793.

> .\bin\Release\sd.exe -m ..\..\stable-diffusion-webui\models\Stable-diffusion\sd_xl_base_1.0.safetensors --vae ..\..\stable-diffusion-webui\models\VAE\sdxl_vae-fp16-fix.safetensors -p "a lovely cat" -v   -H 1024 -W 1024 --diffusion-fa

loading tensors completed, taking 6.39s (process: 0.03s, read: 4.68s, memcpy: 0.00s, convert: 0.30s, copy_to_backend: 1.16s)

It seems that the process time is not significant and is not a bottleneck. Therefore, I think there is no need to perform multi-threading on this process section, which will reduce the complexity of the code.

rmatif · 2025-09-07T04:56:08Z

I have updated sd.cpp to C++17.

Thanks! I've reverted the workaround

It seems that the process time is not significant and is not a bottleneck. Therefore, I think there is no need to perform multi-threading on this process section, which will reduce the complexity of the code

Can you try to mount a ramdisk and load the model from there? Will share some numbers soon

Master:

[INFO ] stable-diffusion.cpp:641  - total params memory size = 4145.07MB (VRAM 0.00MB, RAM 4145.07MB): text_encoders 1118.92MB(RAM), diffusion_model 2931.68MB(RAM), vae 94.47MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:660  - loading model from '/mnt/ramdisk/Diff-InstructStar_q8_0.gguf' completed, taking 2.34s

PR:

[INFO ] stable-diffusion.cpp:641  - total params memory size = 4145.07MB (VRAM 0.00MB, RAM 4145.07MB): text_encoders 1118.92MB(RAM), diffusion_model 2931.68MB(RAM), vae 94.47MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:660  - loading model from '/mnt/ramdisk/Diff-InstructStar_q8_0.gguf' completed, taking 1.01s

leejet · 2025-09-07T05:29:59Z

Ramdisk result

loading tensors completed, taking 5.85s (process: 0.01s, read: 4.28s, memcpy: 0.00s, convert: 0.19s, copy_to_backend: 1.15s)

leejet · 2025-09-07T05:34:15Z

Using a ramdisk didn't make much of a difference in speed. I'm not sure if it's because of the limitation of my memory bandwidth.

rmatif · 2025-09-07T05:42:55Z

Using a ramdisk didn't make much of a difference in speed. I'm not sure if it's because of the limitation of my memory bandwidth.

You can mesure your memory bandiwth with mbw. Here's mine:

AVG     Method: MEMCPY  Elapsed: 0.07953        MiB: 1024.00000 Copy: 12875.013 MiB/s

If you’re already capped by your bandwidth, then this PR won’t make much difference

leejet · 2025-09-07T11:40:28Z

It doesn't seem to be a problem with the memory bandwidth. Perhaps it might be a problem with my ramdisk software.

AVG     Method: MEMCPY  Elapsed: 1.09857        MiB: 10000.00000        Copy: 9102.753 MiB/s

wbruna · 2025-09-07T11:54:45Z

On my Ryzen 5 3400G, RX 7600 XT, SSD storage, Linux 6.12:

Vulkan, cold cache

[INFO ] model.cpp:2216 - loading tensors completed, taking 16.97s (process: 0.12s, read: 11.60s, memcpy: 0.00s, convert: 0.30s, copy_to_backend: 4.74s)
[INFO ] stable-diffusion.cpp:641 - total params memory size = 6751.89MB (VRAM 6751.89MB, RAM 0.00MB): text_encoders 1757.36MB(VRAM), diffusion_model 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:660 - loading model from './cyberrealisticXL_v60.safetensors' completed, taking 16.97s

Vulkan, hot cache

[INFO ] model.cpp:2216 - loading tensors completed, taking 6.04s (process: 0.12s, read: 0.95s, memcpy: 0.00s, convert: 0.31s, copy_to_backend: 4.45s)
[INFO ] stable-diffusion.cpp:641 - total params memory size = 6751.89MB (VRAM 6751.89MB, RAM 0.00MB): text_encoders 1757.36MB(VRAM), diffusion_model 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:660 - loading model from './cyberrealisticXL_v60.safetensors' completed, taking 6.04s

PR, Vulkan, cold cache

[INFO ] stable-diffusion.cpp:641 - total params memory size = 6751.89MB (VRAM 6751.89MB, RAM 0.00MB): text_encoders 1757.36MB(VRAM), diffusion_model 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:660 - loading model from './cyberrealisticXL_v60.safetensors' completed, taking 16.08s

PR, Vulkan, hot cache

[INFO ] stable-diffusion.cpp:641 - total params memory size = 6751.89MB (VRAM 6751.89MB, RAM 0.00MB): text_encoders 1757.36MB(VRAM), diffusion_model 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:660 - loading model from './cyberrealisticXL_v60.safetensors' completed, taking 2.08s

For comparison:

ROCm, cold cache

[INFO ] model.cpp:2216 - loading tensors completed, taking 15.55s (process: 0.11s, read: 11.90s, memcpy: 0.00s, convert: 0.32s, copy_to_backend: 2.98s)
[INFO ] stable-diffusion.cpp:641 - total params memory size = 6751.89MB (VRAM 6751.89MB, RAM 0.00MB): text_encoders 1757.36MB(VRAM), diffusion_model 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:660 - loading model from './cyberrealisticXL_v60.safetensors' completed, taking 15.55s

ROCm, hot cache

[INFO ] model.cpp:2216 - loading tensors completed, taking 4.54s (process: 0.11s, read: 1.36s, memcpy: 0.00s, convert: 0.33s, copy_to_backend: 2.53s)
[INFO ] stable-diffusion.cpp:641 - total params memory size = 6751.89MB (VRAM 6751.89MB, RAM 0.00MB): text_encoders 1757.36MB(VRAM), diffusion_model 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:660 - loading model from './cyberrealisticXL_v60.safetensors' completed, taking 4.54s

(I can also test the PR on ROCm, but it takes a long time to build here 😅 )

$ mbw -t0 -q 1024
AVG Method: MEMCPY Elapsed: 0.13855 MiB: 1024.00000 Copy: 7390.958 MiB/s

wbruna

Looking good so far. Could we also get a clock reading between the preparation and the loading phase?

model.cpp

rmatif · 2025-09-07T14:22:32Z

More numbers:

Master:

[INFO ] stable-diffusion.cpp:641  - total params memory size = 6751.89MB (VRAM 0.00MB, RAM 6751.89MB): text_encoders 1757.36MB(RAM), diffusion_model 4900.07MB(RAM), vae 94.47MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:660  - loading model from '/ramdisk/RealVisXL_V5.0_fp16.safetensors' completed, taking 4.61s

PR:

[INFO ] stable-diffusion.cpp:641  - total params memory size = 6751.89MB (VRAM 0.00MB, RAM 6751.89MB): text_encoders 1757.36MB(RAM), diffusion_model 4900.07MB(RAM), vae 94.47MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:660  - loading model from '/ramdisk/RealVisXL_V5.0_fp16.safetensors' completed, taking 1.34s

leejet · 2025-09-07T15:21:16Z

I have merged the changes from the master branch into your branch https://github.com/leejet/stable-diffusion.cpp/commits/ref-tensor-loading/. If you don't mind, I can directly push it to your branch. These are some of my test data.

Master:

loading tensors completed, taking 6.39s (process: 0.03s, read: 4.68s, memcpy: 0.00s, convert: 0.30s, copy_to_backend: 1.16s)

PR:

loading tensors completed, taking 2.71s (process: 0.02s, read: 1.04s, memcpy: 0.00s, convert: 0.04s, copy_to_backend: 1.50s)

rmatif · 2025-09-07T15:29:32Z

If you don't mind, I can directly push it to your branch. These are some of my test data.

Cool. Sure you can go ahead and push it!

leejet · 2025-09-07T15:39:45Z

I have merged the changes from the master branch into your branch https://github.com/leejet/stable-diffusion.cpp/commits/ref-tensor-loading/. If you don't mind, I can directly push it to your branch. These are some of my test data.

Master:
loading tensors completed, taking 6.39s (process: 0.03s, read: 4.68s, memcpy: 0.00s, convert: 0.30s, copy_to_backend: 1.16s)
PR:
loading tensors completed, taking 2.71s (process: 0.02s, read: 1.04s, memcpy: 0.00s, convert: 0.04s, copy_to_backend: 1.50s)

Based on the above test results, I think that multithreading has very little effect on the speed optimization of preprocess_tensor and dedup. I prefer to keep this part using the original single-threaded processing method, and only use multithreading in other parts.

Green-Sky · 2025-09-07T15:42:11Z

Based on the above test results, I think that multithreading has very little effect on the speed optimization of preprocess_tensor and dedup. I prefer to keep this part using the original single-threaded processing method, and only use multithreading in other parts.

Have you compared dedicated model conversion numbers? or just supplying --type with eg q5_k ?

leejet · 2025-09-07T15:56:06Z

Have you compared dedicated model conversion numbers? or just supplying --type with eg q5_k ?

preprocess_tensor/dedup do not involve type conversion; they only deal with the handling of tensor names and dedup. This is data that includes type conversion.

loading tensors completed, taking 212.47s (process: 0.02s, read: 8.64s, memcpy: 0.00s, convert: 201.98s, copy_to_backend: 1.63s)

rmatif · 2025-09-07T17:11:23Z

Added n_threads override

Number of threads	Load Time for SDXL model
t = 1	4.82s
t = 2	2.62s
t = 3	1.81s
t = 4	1.49s
t = 5	1.30s

After 5 we are hitting the memory bandwidth

rmatif · 2025-09-07T23:07:38Z

@leejet the tensor display stat seems completely broken when it comes to lora

[INFO ] model.cpp:2281 - loading tensors completed, taking 0.61s (process: 0.41s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[DEBUG] ggml_extend.hpp:1597 - lora params backend buffer size =  375.37 MB(VRAM) (2364 tensors)
[DEBUG] model.cpp:2042 - loading tensors from /ramdisk/dmd2_sdxl_4step_lora_fp16.safetensors
  |==================================================| 2364/2364 - 17.24it/s

[INFO ] model.cpp:2281 - loading tensors completed, taking 0.09s (process: 0.04s, read: 0.02s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.05s)

wbruna · 2025-09-08T00:35:13Z

Cold cache:
[INFO ] model.cpp:2281 - loading tensors completed, taking 16.65s (process: 0.04s, read: 12.23s, memcpy: 0.00s, convert: 0.14s, copy_to_backend: 3.85s)

Hot cache:
[INFO ] model.cpp:2281 - loading tensors completed, taking 2.06s (process: 0.04s, read: 0.71s, memcpy: 0.00s, convert: 0.04s, copy_to_backend: 0.94s)

Tmpfs:
[INFO ] model.cpp:2281 - loading tensors completed, taking 2.04s (process: 0.04s, read: 0.68s, memcpy: 0.00s, convert: 0.04s, copy_to_backend: 0.91s)

As a baseline, this is 1 thread, cold/hot cache:
[INFO ] model.cpp:2281 - loading tensors completed, taking 20.34s (process: 0.12s, read: 14.45s, memcpy: 0.00s, convert: 0.33s, copy_to_backend: 5.00s)
[INFO ] model.cpp:2281 - loading tensors completed, taking 6.73s (process: 0.12s, read: 1.30s, memcpy: 0.00s, convert: 0.26s, copy_to_backend: 4.52s)

Speed peaks at 4 threads here (4-core CPU).

So, it looks like ~~the parallel reads are in fact helping a bit-~~ (edit: see below). And depending on the system, a ramdisk could be pointless.

wbruna · 2025-09-08T00:43:07Z

model.cpp

    return res;
 }

-bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb) {
+bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb, int n_threads_p) {
    int64_t process_time_ms         = 0;
    int64_t read_time_ms            = 0;
    int64_t memcpy_time_ms          = 0;


These are being incremented by each thread, so... should at least be atomic?

And thinking about it, it may make more sense to have the time counters per-thread, and average the results.

These are being incremented by each thread, so... should at least be atomic?

And thinking about it, it may make more sense to have the time counters per-thread, and average the results.

On paper it seems like a good idea but in practice it can be misleading. If you have 36 threads, it will bias the average when dividing by that. I'll push it anyway and if it's not desired I'll revert

The problem with a simple sum is the total would end up bigger than the loading time. Correct, but very confusing :-)

A more... say, meaningful measure, could be averaging by time, regardless of the number of threads. Something like (total read time on all threads) / (total time on all threads), then multiplying that by the time measured by the main thread, to get the "total read time".

wbruna · 2025-09-08T02:51:06Z

I tested the condition variable + mutex approach: f9a2adb

The reading speed got a little bit slower:

[INFO ] model.cpp:2324 - loading tensors completed, taking 2.13s (process: 0.05s, read: 0.74s, memcpy: 0.00s, convert: 0.12s, copy_to_backend: 1.04s)

But using serialized reads made it a little bit faster:

[INFO ] model.cpp:2324 - loading tensors completed, taking 1.94s (process: 0.04s, read: 0.69s, memcpy: 0.00s, convert: 0.03s, copy_to_backend: 1.03s)

I didn't fix the time counters for the multi-thread updates, though, so I wouldn't put too much trust on those values.

rmatif · 2025-09-08T10:52:40Z

Based on the above test results, I think that multithreading has very little effect on the speed optimization of preprocess_tensor and dedup. I prefer to keep this part using the original single-threaded processing method, and only use multithreading in other parts.

For lora it does:

Before:

[INFO ] stable-diffusion.cpp:848  - attempting to apply 1 LoRAs
[INFO ] model.cpp:1043 - load /ramdisk/NatsukiAoi ag4o.safetensors using safetensors format
[DEBUG] model.cpp:1150 - init from '/ramdisk/NatsukiAoi ag4o.safetensors', prefix = ''
[INFO ] lora.hpp:119  - loading LoRA from '/ramdisk/NatsukiAoi ag4o.safetensors'
[DEBUG] model.cpp:2042 - loading tensors from /ramdisk/NatsukiAoi ag4o.safetensors
  |==================================================| 2166/2166 - 5.00it/s

[INFO ] model.cpp:2281 - loading tensors completed, taking 1.38s (process: 1.17s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[DEBUG] ggml_extend.hpp:1597 - lora params backend buffer size =  324.78 MB(VRAM) (2166 tensors)
[DEBUG] model.cpp:2042 - loading tensors from /ramdisk/NatsukiAoi ag4o.safetensors
  |==================================================| 2166/2166 - 4.15it/s

[INFO ] model.cpp:2281 - loading tensors completed, taking 0.30s (process: 0.06s, read: 0.01s, memcpy: 0.00s, convert: 0.01s, copy_to_backend: 0.02s)
[DEBUG] lora.hpp:161  - lora type: ".lora_down"/".lora_up"
[DEBUG] lora.hpp:163  - finished loaded lora
[DEBUG] lora.hpp:860  - (2166 / 2166) LoRA tensors will be applied
[DEBUG] ggml_extend.hpp:1425 - lora compute buffer size: 101.56 MB(VRAM)
[DEBUG] lora.hpp:860  - (2166 / 2166) LoRA tensors will be applied
[INFO ] stable-diffusion.cpp:825  - lora 'NatsukiAoi ag4o' applied, taking 3.16s
[INFO ] stable-diffusion.cpp:868  - apply_loras completed, taking 3.16s

After:

[INFO ] stable-diffusion.cpp:848  - attempting to apply 1 LoRAs
[INFO ] model.cpp:1043 - load /ramdisk/NatsukiAoi ag4o.safetensors using safetensors format
[DEBUG] model.cpp:1150 - init from '/ramdisk/NatsukiAoi ag4o.safetensors', prefix = ''
[INFO ] lora.hpp:120  - loading LoRA from '/ramdisk/NatsukiAoi ag4o.safetensors'
[DEBUG] model.cpp:2042 - loading tensors from /ramdisk/NatsukiAoi ag4o.safetensors
  |==================================================| 2166/2166 - 17.86it/s

[INFO ] model.cpp:2281 - loading tensors completed, taking 0.10s (process: 0.05s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[DEBUG] ggml_extend.hpp:1597 - lora params backend buffer size =  324.78 MB(VRAM) (2166 tensors)
[DEBUG] model.cpp:2042 - loading tensors from /ramdisk/NatsukiAoi ag4o.safetensors
  |==================================================| 2166/2166 - 16.13it/s

[INFO ] model.cpp:2281 - loading tensors completed, taking 0.10s (process: 0.04s, read: 0.01s, memcpy: 0.00s, convert: 0.01s, copy_to_backend: 0.08s)
[DEBUG] lora.hpp:174  - lora type: ".lora_down"/".lora_up"
[DEBUG] lora.hpp:176  - finished loaded lora
[DEBUG] lora.hpp:873  - (2166 / 2166) LoRA tensors will be applied
[DEBUG] ggml_extend.hpp:1425 - lora compute buffer size: 101.56 MB(VRAM)
[DEBUG] lora.hpp:873  - (2166 / 2166) LoRA tensors will be applied
[INFO ] stable-diffusion.cpp:825  - lora 'NatsukiAoi ag4o' applied, taking 1.58s
[INFO ] stable-diffusion.cpp:868  - apply_loras completed, taking 1.58s

wbruna · 2025-09-08T15:01:58Z

model.cpp

            }
+            pretty_progress(total_tensors_processed + current_idx, total_tensors_to_process, (ggml_time_ms() - t_start) / 1000.0f);


Just a reminder, since the original review got marked resolved: the progress here should display average time (note the (1000.0f * tensor_count) on the original code).

opt tensor loading

55b7707

rmatif force-pushed the ref-tensor-loading branch from e2c6c10 to 55b7707 Compare September 6, 2025 17:16

fix build failure

6fa2b26

revert the changes

12295b2

wbruna suggested changes Sep 7, 2025

View reviewed changes

model.cpp Outdated Show resolved Hide resolved

model.cpp Show resolved Hide resolved

Merge branch 'master' into ref-tensor-loading

40427f9

allow the use of n_threads

401c42c

fix lora loading

9e0d8e5

wbruna suggested changes Sep 8, 2025

View reviewed changes

rmatif added 2 commits September 8, 2025 10:48

optimize lora loading

507f406

add mutex

e7cd3ca

use atomic

289c329

wbruna suggested changes Sep 8, 2025

View reviewed changes

fix build

62ba7f7

		}
		pretty_progress(total_tensors_processed + current_idx, total_tensors_to_process, (ggml_time_ms() - t_start) / 1000.0f);

Optimize tensor loading #790

Are you sure you want to change the base?

Optimize tensor loading #790

Conversation

rmatif commented Sep 6, 2025

Uh oh!

rmatif commented Sep 6, 2025

Uh oh!

Green-Sky commented Sep 6, 2025

Uh oh!

rmatif commented Sep 6, 2025

Uh oh!

hartmark commented Sep 7, 2025

Uh oh!

leejet commented Sep 7, 2025

Uh oh!

leejet commented Sep 7, 2025

Uh oh!

leejet commented Sep 7, 2025

Uh oh!

rmatif commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leejet commented Sep 7, 2025

Uh oh!

leejet commented Sep 7, 2025

Uh oh!

rmatif commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leejet commented Sep 7, 2025

Uh oh!

wbruna commented Sep 7, 2025

Uh oh!

wbruna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rmatif commented Sep 7, 2025

Uh oh!

leejet commented Sep 7, 2025

Uh oh!

rmatif commented Sep 7, 2025

Uh oh!

leejet commented Sep 7, 2025

Uh oh!

Green-Sky commented Sep 7, 2025

Uh oh!

leejet commented Sep 7, 2025

Uh oh!

rmatif commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmatif commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wbruna commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wbruna Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmatif Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

wbruna Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

wbruna commented Sep 8, 2025

Uh oh!

rmatif commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wbruna Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rmatif commented Sep 7, 2025 •

edited

Loading

rmatif commented Sep 7, 2025 •

edited

Loading

rmatif commented Sep 7, 2025 •

edited

Loading

rmatif commented Sep 7, 2025 •

edited

Loading

wbruna commented Sep 8, 2025 •

edited

Loading

wbruna Sep 8, 2025 •

edited

Loading

rmatif commented Sep 8, 2025 •

edited

Loading