Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 7, 2025

Mirrored from ggml-org/llama.cpp#17046

fix #17037 #17058
cont #16812

The small SWA caches can be padded to 256 without concerns about memory usage. Pad the SWA cache size to 256. This is friendly for the CUDA backend since the FA implementation benefits from round sizes of the K/V tensors.

GGML_CUDA=ON CUDA_VISIBLE_DEVICES=0 ./scripts/compare-commits.sh a8ca18b4b d2c30c61a llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -m /home/ggerganov/.cache/llama.cpp/ggml-org_Qwen2.5-Coder-3B-Q8_0-GGUF_qwen2.5-coder-3b-q8_0.gguf -ngl 99 -d 4096,8192,16384,32768 -ub 512,4096 -b 4096 -fa 1 -n 32 -mmp 0

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes

Model Microbatch size Test t/s a8ca18b t/s d2c30c6 Speedup
gpt-oss 20B MXFP4 MoE 512 pp512@d4096 9702.25 9724.85 1.00
gpt-oss 20B MXFP4 MoE 512 pp512@d8192 8609.17 8643.88 1.00
gpt-oss 20B MXFP4 MoE 512 pp512@d16384 7169.27 7224.84 1.01
gpt-oss 20B MXFP4 MoE 512 pp512@d32768 5349.99 5380.43 1.01
gpt-oss 20B MXFP4 MoE 512 tg32@d4096 306.13 338.25 1.10
gpt-oss 20B MXFP4 MoE 512 tg32@d8192 292.57 317.68 1.09
gpt-oss 20B MXFP4 MoE 512 tg32@d16384 265.66 304.90 1.15
gpt-oss 20B MXFP4 MoE 512 tg32@d32768 236.92 262.29 1.11
gpt-oss 20B MXFP4 MoE 4096 pp512@d4096 8720.64 8735.30 1.00
gpt-oss 20B MXFP4 MoE 4096 pp512@d8192 7908.52 7799.28 0.99
gpt-oss 20B MXFP4 MoE 4096 pp512@d16384 6656.47 6583.46 0.99
gpt-oss 20B MXFP4 MoE 4096 pp512@d32768 5063.06 4967.41 0.98
gpt-oss 20B MXFP4 MoE 4096 tg32@d4096 296.76 318.91 1.07
gpt-oss 20B MXFP4 MoE 4096 tg32@d8192 279.30 322.56 1.15
gpt-oss 20B MXFP4 MoE 4096 tg32@d16384 251.35 283.65 1.13
gpt-oss 20B MXFP4 MoE 4096 tg32@d32768 227.88 253.40 1.11
qwen2 3B Q8_0 512 pp512@d4096 17229.54 17278.24 1.00
qwen2 3B Q8_0 512 pp512@d8192 14011.21 14133.12 1.01
qwen2 3B Q8_0 512 pp512@d16384 10304.85 10307.07 1.00
qwen2 3B Q8_0 512 pp512@d32768 6612.16 6567.63 0.99
qwen2 3B Q8_0 512 tg32@d4096 279.56 294.93 1.05
qwen2 3B Q8_0 512 tg32@d8192 210.92 213.51 1.01
qwen2 3B Q8_0 512 tg32@d16384 185.15 188.19 1.02
qwen2 3B Q8_0 512 tg32@d32768 147.62 149.84 1.02
qwen2 3B Q8_0 4096 pp512@d4096 16599.01 16781.20 1.01
qwen2 3B Q8_0 4096 pp512@d8192 13664.67 13715.42 1.00
qwen2 3B Q8_0 4096 pp512@d16384 10056.36 10027.96 1.00
qwen2 3B Q8_0 4096 pp512@d32768 6585.53 6579.99 1.00
qwen2 3B Q8_0 4096 tg32@d4096 274.54 279.28 1.02
qwen2 3B Q8_0 4096 tg32@d8192 219.26 224.87 1.03
qwen2 3B Q8_0 4096 tg32@d16384 188.45 192.18 1.02
qwen2 3B Q8_0 4096 tg32@d32768 148.30 150.20 1.01

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: KV Cache Padding Optimization

Overview

Pull Request #115 implements KV cache padding to 256-byte boundaries to optimize CUDA Flash Attention performance. While achieving the intended GPU performance improvements (1-15% in token generation), the changes introduce measurable CPU overhead in container operations.

Key Findings

Highest Performance Impact:

  • std::vector<llama_ubatch>::end() shows 241% Response Time increase (76 ns → 261 ns) and 340% Throughput degradation (54 ns → 239 ns)
  • This STL iterator function is not part of core inference functions, limiting direct impact on tokens per second

Core Function Impact Assessment:

  • No changes detected in critical inference functions (llama_decode, llama_encode, llama_tokenize)
  • The performance degradation affects batch container operations rather than primary inference paths
  • Based on the reference model performance (7% tokens/second reduction for 2ms llama_decode slowdown), the current changes should not significantly impact inference throughput since core functions remain unchanged

Power Consumption Analysis:

  • build.bin.libllama.so: +0.025% increase (+71 nJ)
  • All other binaries show 0% change
  • Minimal overall system power impact, indicating localized performance effects

Technical Root Cause:

  • Flame Graph Analysis: 92% of execution time concentrated in the end() function itself, indicating computational overhead rather than delegation issues
  • CFG Comparison: Additional branching overhead and 64KB memory layout shift affecting instruction cache efficiency
  • Code Changes: Context size padding (GGML_PAD(cparams.n_ctx, 256)) increases container sizes, leading to higher iterator calculation costs

GitHub Code Review Insights:

  • Changes successfully implement intended CUDA optimizations
  • Memory alignment improvements benefit Flash Attention operations
  • Container size modifications create larger memory footprints affecting cache locality

Actionable Recommendations

  1. Conditional Padding: Implement backend-specific padding (CUDA vs CPU) to avoid CPU overhead when GPU acceleration is unavailable
  2. Container Optimization: Cache container size calculations to reduce iterator arithmetic overhead
  3. Memory Layout: Profile instruction cache performance with the new 64KB memory shift to optimize placement

The changes achieve their CUDA performance goals while introducing acceptable CPU overhead in non-critical container operations.

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: KV Cache Padding Optimization

Overview

Pull Request #115 implements KV cache padding to 256-byte boundaries for GPU performance optimization, particularly benefiting CUDA Flash Attention implementations. While the changes target performance improvements, analysis reveals an unrelated performance regression in container operations.

Key Findings

Performance Regression Identified:

  • std::vector<llama_ubatch>::end() shows 241% Response Time increase (76 ns → 261 ns) and 340% Throughput degradation (54 ns → 239 ns self-time)
  • This function is not part of core inference paths (llama_decode, llama_encode, llama_tokenize), so tokens per second performance remains unaffected

Power Consumption Analysis:

  • Minimal impact across binaries: only build.bin.libllama.so shows 0.025% increase (280,780 nJ → 280,851 nJ)
  • All other binaries show no measurable power consumption changes
  • Overall energy efficiency impact is negligible

Flame Graph and CFG Analysis:

  • Root cause identified as compiler optimization regression in entry block (809% execution time increase)
  • Memory access pattern changes: Additional load instruction in stack validation adds measurable latency
  • Control flow consolidation: Code reorganization trades branch prediction efficiency for increased per-block execution time
  • Issue stems from assembly-level changes rather than algorithmic modifications

GitHub Code Review Assessment:

  • Well-implemented padding strategy: Consistent 256-byte alignment across context and KV cache systems
  • Memory trade-off acceptable: Padding increases memory usage by up to 255 elements but enables significant GPU performance gains
  • Backward compatibility maintained: API contracts preserved while improving hardware alignment

Core Function Impact:

  • No changes detected in critical inference functions (llama_decode, llama_encode, llama_tokenize)
  • Padding modifications affect memory allocation patterns but not execution paths
  • Container operation regression is isolated and doesn't impact inference throughput

Actionable Recommendations

  1. Investigate compiler optimization settings for the affected vector operations to resolve the 241% performance regression
  2. Review memory alignment changes in std::vector<llama_ubatch> container usage patterns
  3. Validate GPU performance improvements materialize in target CUDA workloads as intended by the padding optimization

The core KV cache padding changes represent sound optimization for GPU acceleration, while the container performance issue requires separate compiler-level investigation.

@DajanaV DajanaV force-pushed the main branch 19 times, most recently from 0ad40ce to 0fa8f01 Compare November 10, 2025 09:10
@DajanaV DajanaV force-pushed the main branch 30 times, most recently from 9ea0205 to 1308d3f Compare November 14, 2025 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants