-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Vulkan k-quant mmq and ggml-backend offload functionality #6155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Clean up cpu assist code, replaced with ggml-backend offload function
Note: this is on master, I was trying to test it there first as a baseline, but the same happens on the PR. I have a 3080 and a 3090 Ti, and I am having some trouble getting ggml-vulkan to use the 3090 Ti. It looks like it only finds the 3080.
|
@slaren Because Vulkan supports a lot more devices I haven't settled on a multigpu-default yet. Currently it just uses the first one by default, if you want more you need to set the We could use this chance to discuss and set a better default as well. I guess all dedicated GPUs would be sane? |
Sorry, I had forgotten about
build: d0d5de4 (2464)
build: 86386e2 (2460) |
I would think so. It will still take a while, but in the future I would like to improve the device selection, allow the users to select the devices that they want to use by name, and improve the defaults. It is already possible to select what devices to use with |
I found two issues that I have to fix before this can get merged:
I'll look into these problems in the next days. |
Thanks for this update! Good news: restarting is fixed! |
With this PR k quants performance on my card have almost reached and ,in q4_k_m, even surpassed ROCm's speed. Ram usage is slightly higher than on master and still higher than ROCm. Model: llama 2 AMD Radeon RX 5700 XT (RADV NAVI10)
|
Here are my numbers: Only listing the quants where the pp result changed.
And a couple of tests with -ngl 0:
|
Fix validation issue
@slaren Do you have an idea why test-backend-ops doesn't build with Vulkan support with cmake?
It works fine with cublas. |
Maybe |
Yeah, it was not populated yet. Thanks. |
Interestingly enough I'm actually seeing 25% slower prompt processing speed with this PR compared to master. This only happens on the K-quants and inference speed remains the same. Maybe this is an architecture thing but it could also be due to the fact that I'm running in fp32 mode whereas all the other commenters have fp16 cards.
|
I fixed GET_ROWS and pulled upstream changes, seems to work. There might be an issue with f16, but that's not that important. I just have to fix UMA, then this PR can be merged. |
Thanks for testing this. I can confirm it's related to fp16 on AMD (GCN?) GPUs. I can only guess it's related to register pressure, since float32 uses twice the space for its variables. The k-quants need more registers to dequantize, and now that happens in the same shader as the matrix multiplication itself. I'll have to take a look in the future on whether I can mitigate that. |
Since I reported previously on memory consumption, adding console log with the latest changes. At the moment both RAM and VRAM usage is significantly higher than with Clblast (which is missing backend and some operations) - around 400MB more VRAM and 4GB more RAM. At the same time the main benefit of Vulkan now is higher prompt processing speed and more stable generation speed over time. |
Thanks for the report, I don't pay enough attention to RAM use since my development server has 128GB. If Vulkan uses 4GB more RAM for the same number of offloaded layers it's probably some issue with the staging buffers. I'll take a look at some point. |
I checked again and it seems that Vulkan UMA has not regressed. It works fine if you put no tensors or all tensors on GPU, but slows down if you do anything else. I should fix that in the future, but it is not necessary to hold back this PR for that reason, since master is in the same state. I'll wait for the CI checks and merge this afterwards. |
) * Fix Vulkan no kv offload incoherence * Add k-quant mul mat mat shaders * Rework working buffer allocation, reduces vram use noticeably Clean up cpu assist code, replaced with ggml-backend offload function * Default to all dedicated GPUs * Add fallback for integrated GPUs if no dedicated GPUs are found * Add debug info which device is allocating memory * Fix Intel dequant issue Fix validation issue * Fix Vulkan GGML_OP_GET_ROWS implementation * Clean up merge artifacts * Remove Vulkan warning
) * Fix Vulkan no kv offload incoherence * Add k-quant mul mat mat shaders * Rework working buffer allocation, reduces vram use noticeably Clean up cpu assist code, replaced with ggml-backend offload function * Default to all dedicated GPUs * Add fallback for integrated GPUs if no dedicated GPUs are found * Add debug info which device is allocating memory * Fix Intel dequant issue Fix validation issue * Fix Vulkan GGML_OP_GET_ROWS implementation * Clean up merge artifacts * Remove Vulkan warning
) * Fix Vulkan no kv offload incoherence * Add k-quant mul mat mat shaders * Rework working buffer allocation, reduces vram use noticeably Clean up cpu assist code, replaced with ggml-backend offload function * Default to all dedicated GPUs * Add fallback for integrated GPUs if no dedicated GPUs are found * Add debug info which device is allocating memory * Fix Intel dequant issue Fix validation issue * Fix Vulkan GGML_OP_GET_ROWS implementation * Clean up merge artifacts * Remove Vulkan warning
I added k-quant mmq shaders and cleaned up the cpu-assist functions, as they are now replaced with the ggml-backend code.
I also fixed the working buffer allocation code, it should now use noticeably less VRAM.
Should hopefully fix #5848