-
Notifications
You must be signed in to change notification settings - Fork 0
UPSTREAM PR #17046: kv-cache : pad the size of the small SWA cache for performance #115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Access the complete analysis in the LOCI Dashboard Performance Analysis Summary: KV Cache Padding OptimizationOverviewPull Request #115 implements KV cache padding to 256-byte boundaries to optimize CUDA Flash Attention performance. While achieving the intended GPU performance improvements (1-15% in token generation), the changes introduce measurable CPU overhead in container operations. Key FindingsHighest Performance Impact:
Core Function Impact Assessment:
Power Consumption Analysis:
Technical Root Cause:
GitHub Code Review Insights:
Actionable Recommendations
The changes achieve their CUDA performance goals while introducing acceptable CPU overhead in non-critical container operations. |
d2c30c6 to
7b2551a
Compare
|
Access the complete analysis in the LOCI Dashboard Performance Analysis Summary: KV Cache Padding OptimizationOverviewPull Request #115 implements KV cache padding to 256-byte boundaries for GPU performance optimization, particularly benefiting CUDA Flash Attention implementations. While the changes target performance improvements, analysis reveals an unrelated performance regression in container operations. Key FindingsPerformance Regression Identified:
Power Consumption Analysis:
Flame Graph and CFG Analysis:
GitHub Code Review Assessment:
Core Function Impact:
Actionable Recommendations
The core KV cache padding changes represent sound optimization for GPU acceleration, while the container performance issue requires separate compiler-level investigation. |
0ad40ce to
0fa8f01
Compare
9ea0205 to
1308d3f
Compare
Mirrored from ggml-org/llama.cpp#17046
fix #17037 #17058
cont #16812
The small SWA caches can be padded to 256 without concerns about memory usage.Pad the SWA cache size to 256. This is friendly for the CUDA backend since the FA implementation benefits from round sizes of the K/V tensors.GGML_CUDA=ON CUDA_VISIBLE_DEVICES=0 ./scripts/compare-commits.sh a8ca18b4b d2c30c61a llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -m /home/ggerganov/.cache/llama.cpp/ggml-org_Qwen2.5-Coder-3B-Q8_0-GGUF_qwen2.5-coder-3b-q8_0.gguf -ngl 99 -d 4096,8192,16384,32768 -ub 512,4096 -b 4096 -fa 1 -n 32 -mmp 0ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes