Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Kernels] Fix UInt32 overflow in flash attention benchmark FLOP count#6623

Open
gabrieldemarmiesse wants to merge 1 commit into
modular:mainfrom
gabrieldemarmiesse:fix-flash-attn-bench-flop-overflow
Open

[Kernels] Fix UInt32 overflow in flash attention benchmark FLOP count#6623
gabrieldemarmiesse wants to merge 1 commit into
modular:mainfrom
gabrieldemarmiesse:fix-flash-attn-bench-flop-overflow

Conversation

@gabrieldemarmiesse

@gabrieldemarmiesse gabrieldemarmiesse commented May 29, 2026

Copy link
Copy Markdown
Contributor

Type of change

  • Bug fix (non-breaking change that fixes an issue)
  • Performance improvement (includes benchmark results below)
  • Documentation update
  • New feature or public API (requires prior proposal or issue approval)
  • Refactor / internal cleanup (no user-visible change)
  • Build, CI, or tooling change

Motivation

What changed

bench_kv_cache_ragged_flash_attention computed 4 * num_q_heads * (cache+seq) * seq * head_dim in UInt32 before casting to Int. At large shapes this intermediate exceeds 2³² and wraps, reporting bogus throughput — e.g. a bs=8, seq=2048, cache=2048 prefill landed exactly on a multiple of 2³² and printed 0.0 GFLOPS/s.

Testing

After the fix the same shape reports ~529 TFLOP/s (~53% of H100 bf16 peak); decode shapes (small products, never overflowed) are unchanged.

Checklist

  • The linked issue above has been reviewed by a maintainer and is
    agreed-upon, or this is a trivial fix that does not need prior
    approval
  • PR is small and focused — I've split larger changes into a sequence of
    smaller PRs where possible (see
    pull request sizes)
  • I ran ./bazelw run format to format my changes
  • I added or updated tests to cover my changes
  • If AI tools assisted with this contribution, I have included an
    Assisted-by: trailer in my commit message or this PR description (see
    AI Tool Use Policy)
    Assisted-by Claude

BEGIN_PUBLIC
[Kernels] Fix UInt32 overflow in flash attention benchmark FLOP count

The FLOP accounting in bench_kv_cache_ragged_flash_attention computed
`4 * num_q_heads * (cache+seq) * seq * head_dim` entirely in UInt32
before casting to Int. At large shapes this intermediate exceeds 2^32
and wraps around, reporting a bogus throughput (e.g. exactly 0.0
GFLOPS/s when the product lands on a multiple of 2^32).

Compute the product in Int so the FLOP count no longer overflows.
Small (decode) shapes are unaffected; large (prefill) shapes now report
correct throughput.
END_PUBLIC

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Signed-off-by: Gabriel <[email protected]>
@gabrieldemarmiesse gabrieldemarmiesse marked this pull request as ready for review May 29, 2026 09:45
@gabrieldemarmiesse gabrieldemarmiesse requested a review from a team as a code owner May 29, 2026 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant