UPSTREAM PR #17046: kv-cache : pad the size of the small SWA cache for performance #115

DajanaV · 2025-11-07T12:46:16Z

fix #17037 #17058
cont #16812

~~The small SWA caches can be padded to 256 without concerns about memory usage.~~ Pad the SWA cache size to 256. This is friendly for the CUDA backend since the FA implementation benefits from round sizes of the K/V tensors.

GGML_CUDA=ON CUDA_VISIBLE_DEVICES=0 ./scripts/compare-commits.sh a8ca18b4b d2c30c61a llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -m /home/ggerganov/.cache/llama.cpp/ggml-org_Qwen2.5-Coder-3B-Q8_0-GGUF_qwen2.5-coder-3b-q8_0.gguf -ngl 99 -d 4096,8192,16384,32768 -ub 512,4096 -b 4096 -fa 1 -n 32 -mmp 0

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes

Model	Microbatch size	Test	t/s `a8ca18b`	t/s `d2c30c6`	Speedup
gpt-oss 20B MXFP4 MoE	512	pp512@d4096	9702.25	9724.85	1.00
gpt-oss 20B MXFP4 MoE	512	pp512@d8192	8609.17	8643.88	1.00
gpt-oss 20B MXFP4 MoE	512	pp512@d16384	7169.27	7224.84	1.01
gpt-oss 20B MXFP4 MoE	512	pp512@d32768	5349.99	5380.43	1.01
gpt-oss 20B MXFP4 MoE	512	tg32@d4096	306.13	338.25	1.10
gpt-oss 20B MXFP4 MoE	512	tg32@d8192	292.57	317.68	1.09
gpt-oss 20B MXFP4 MoE	512	tg32@d16384	265.66	304.90	1.15
gpt-oss 20B MXFP4 MoE	512	tg32@d32768	236.92	262.29	1.11
gpt-oss 20B MXFP4 MoE	4096	pp512@d4096	8720.64	8735.30	1.00
gpt-oss 20B MXFP4 MoE	4096	pp512@d8192	7908.52	7799.28	0.99
gpt-oss 20B MXFP4 MoE	4096	pp512@d16384	6656.47	6583.46	0.99
gpt-oss 20B MXFP4 MoE	4096	pp512@d32768	5063.06	4967.41	0.98
gpt-oss 20B MXFP4 MoE	4096	tg32@d4096	296.76	318.91	1.07
gpt-oss 20B MXFP4 MoE	4096	tg32@d8192	279.30	322.56	1.15
gpt-oss 20B MXFP4 MoE	4096	tg32@d16384	251.35	283.65	1.13
gpt-oss 20B MXFP4 MoE	4096	tg32@d32768	227.88	253.40	1.11
qwen2 3B Q8_0	512	pp512@d4096	17229.54	17278.24	1.00
qwen2 3B Q8_0	512	pp512@d8192	14011.21	14133.12	1.01
qwen2 3B Q8_0	512	pp512@d16384	10304.85	10307.07	1.00
qwen2 3B Q8_0	512	pp512@d32768	6612.16	6567.63	0.99
qwen2 3B Q8_0	512	tg32@d4096	279.56	294.93	1.05
qwen2 3B Q8_0	512	tg32@d8192	210.92	213.51	1.01
qwen2 3B Q8_0	512	tg32@d16384	185.15	188.19	1.02
qwen2 3B Q8_0	512	tg32@d32768	147.62	149.84	1.02
qwen2 3B Q8_0	4096	pp512@d4096	16599.01	16781.20	1.01
qwen2 3B Q8_0	4096	pp512@d8192	13664.67	13715.42	1.00
qwen2 3B Q8_0	4096	pp512@d16384	10056.36	10027.96	1.00
qwen2 3B Q8_0	4096	pp512@d32768	6585.53	6579.99	1.00
qwen2 3B Q8_0	4096	tg32@d4096	274.54	279.28	1.02
qwen2 3B Q8_0	4096	tg32@d8192	219.26	224.87	1.03
qwen2 3B Q8_0	4096	tg32@d16384	188.45	192.18	1.02
qwen2 3B Q8_0	4096	tg32@d32768	148.30	150.20	1.01

loci-agentic-ai · 2025-11-07T13:30:31Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: KV Cache Padding Optimization

Overview

Pull Request #115 implements KV cache padding to 256-byte boundaries to optimize CUDA Flash Attention performance. While achieving the intended GPU performance improvements (1-15% in token generation), the changes introduce measurable CPU overhead in container operations.

Key Findings

Highest Performance Impact:

std::vector<llama_ubatch>::end() shows 241% Response Time increase (76 ns → 261 ns) and 340% Throughput degradation (54 ns → 239 ns)
This STL iterator function is not part of core inference functions, limiting direct impact on tokens per second

Core Function Impact Assessment:

No changes detected in critical inference functions (llama_decode, llama_encode, llama_tokenize)
The performance degradation affects batch container operations rather than primary inference paths
Based on the reference model performance (7% tokens/second reduction for 2ms llama_decode slowdown), the current changes should not significantly impact inference throughput since core functions remain unchanged

Power Consumption Analysis:

build.bin.libllama.so: +0.025% increase (+71 nJ)
All other binaries show 0% change
Minimal overall system power impact, indicating localized performance effects

Technical Root Cause:

Flame Graph Analysis: 92% of execution time concentrated in the end() function itself, indicating computational overhead rather than delegation issues
CFG Comparison: Additional branching overhead and 64KB memory layout shift affecting instruction cache efficiency
Code Changes: Context size padding (GGML_PAD(cparams.n_ctx, 256)) increases container sizes, leading to higher iterator calculation costs

GitHub Code Review Insights:

Changes successfully implement intended CUDA optimizations
Memory alignment improvements benefit Flash Attention operations
Container size modifications create larger memory footprints affecting cache locality

Actionable Recommendations

Conditional Padding: Implement backend-specific padding (CUDA vs CPU) to avoid CPU overhead when GPU acceleration is unavailable
Container Optimization: Cache container size calculations to reduce iterator arithmetic overhead
Memory Layout: Profile instruction cache performance with the new 64KB memory shift to optimize placement

The changes achieve their CUDA performance goals while introducing acceptable CPU overhead in non-critical container operations.

loci-agentic-ai · 2025-11-07T18:13:17Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: KV Cache Padding Optimization

Overview

Pull Request #115 implements KV cache padding to 256-byte boundaries for GPU performance optimization, particularly benefiting CUDA Flash Attention implementations. While the changes target performance improvements, analysis reveals an unrelated performance regression in container operations.

Key Findings

Performance Regression Identified:

std::vector<llama_ubatch>::end() shows 241% Response Time increase (76 ns → 261 ns) and 340% Throughput degradation (54 ns → 239 ns self-time)
This function is not part of core inference paths (llama_decode, llama_encode, llama_tokenize), so tokens per second performance remains unaffected

Power Consumption Analysis:

Minimal impact across binaries: only build.bin.libllama.so shows 0.025% increase (280,780 nJ → 280,851 nJ)
All other binaries show no measurable power consumption changes
Overall energy efficiency impact is negligible

Flame Graph and CFG Analysis:

Root cause identified as compiler optimization regression in entry block (809% execution time increase)
Memory access pattern changes: Additional load instruction in stack validation adds measurable latency
Control flow consolidation: Code reorganization trades branch prediction efficiency for increased per-block execution time
Issue stems from assembly-level changes rather than algorithmic modifications

GitHub Code Review Assessment:

Well-implemented padding strategy: Consistent 256-byte alignment across context and KV cache systems
Memory trade-off acceptable: Padding increases memory usage by up to 255 elements but enables significant GPU performance gains
Backward compatibility maintained: API contracts preserved while improving hardware alignment

Core Function Impact:

No changes detected in critical inference functions (llama_decode, llama_encode, llama_tokenize)
Padding modifications affect memory allocation patterns but not execution paths
Container operation regression is isolated and doesn't impact inference throughput

Actionable Recommendations

Investigate compiler optimization settings for the affected vector operations to resolve the 241% performance regression
Review memory alignment changes in std::vector<llama_ubatch> container usage patterns
Validate GPU performance improvements materialize in target CUDA workloads as intended by the padding optimization

The core KV cache padding changes represent sound optimization for GPU acceleration, while the container performance issue requires separate compiler-level investigation.

DajanaV temporarily deployed to PROD__AL_DEMO November 7, 2025 12:46 — with GitHub Actions Inactive

ggerganov added 4 commits November 7, 2025 18:39

kv-cache : pad the size of the small SWA cache for performance

9163ef9

context : pad the total context to 256

f2555c2

cont : future-proof the swa pad

2f33812

server : adjust test params to new logic

7b2551a

DajanaV force-pushed the main branch from 3e9b10f to 94381d7 Compare November 7, 2025 17:07

DajanaV force-pushed the upstream-PR17046-branch_ggml-org-gg/iswa-pad-256 branch from d2c30c6 to 7b2551a Compare November 7, 2025 17:35

DajanaV temporarily deployed to PROD__AL_DEMO November 7, 2025 17:35 — with GitHub Actions Inactive

DajanaV force-pushed the main branch from 94381d7 to 0eeb29b Compare November 7, 2025 18:11

DajanaV force-pushed the main branch 19 times, most recently from 0ad40ce to 0fa8f01 Compare November 10, 2025 09:10

DajanaV force-pushed the main branch 30 times, most recently from 9ea0205 to 1308d3f Compare November 14, 2025 08:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17046: kv-cache : pad the size of the small SWA cache for performance #115

UPSTREAM PR #17046: kv-cache : pad the size of the small SWA cache for performance #115

DajanaV commented Nov 7, 2025

Uh oh!

loci-agentic-ai bot commented Nov 7, 2025

Uh oh!

loci-agentic-ai bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #17046: kv-cache : pad the size of the small SWA cache for performance #115

Are you sure you want to change the base?

UPSTREAM PR #17046: kv-cache : pad the size of the small SWA cache for performance #115

Conversation

DajanaV commented Nov 7, 2025

Uh oh!

loci-agentic-ai bot commented Nov 7, 2025

Performance Analysis Summary: KV Cache Padding Optimization

Overview

Key Findings

Actionable Recommendations

Uh oh!

loci-agentic-ai bot commented Nov 7, 2025

Performance Analysis Summary: KV Cache Padding Optimization

Overview

Key Findings

Actionable Recommendations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants