Codestin Search App

gabrieldemarmiesse · 2026-05-29T09:44:44Z

Type of change

Bug fix (non-breaking change that fixes an issue)
Performance improvement (includes benchmark results below)
Documentation update
New feature or public API (requires prior proposal or issue approval)
Refactor / internal cleanup (no user-visible change)
Build, CI, or tooling change

Motivation

What changed

bench_kv_cache_ragged_flash_attention computed 4 * num_q_heads * (cache+seq) * seq * head_dim in UInt32 before casting to Int. At large shapes this intermediate exceeds 2³² and wraps, reporting bogus throughput — e.g. a bs=8, seq=2048, cache=2048 prefill landed exactly on a multiple of 2³² and printed 0.0 GFLOPS/s.

Testing

After the fix the same shape reports ~529 TFLOP/s (~53% of H100 bf16 peak); decode shapes (small products, never overflowed) are unchanged.

Checklist

The linked issue above has been reviewed by a maintainer and is
agreed-upon, or this is a trivial fix that does not need prior
approval
PR is small and focused — I've split larger changes into a sequence of
smaller PRs where possible (see
pull request sizes)
I ran ./bazelw run format to format my changes
I added or updated tests to cover my changes
If AI tools assisted with this contribution, I have included an
Assisted-by: trailer in my commit message or this PR description (see
AI Tool Use Policy)
Assisted-by Claude

BEGIN_PUBLIC [Kernels] Fix UInt32 overflow in flash attention benchmark FLOP count The FLOP accounting in bench_kv_cache_ragged_flash_attention computed `4 * num_q_heads * (cache+seq) * seq * head_dim` entirely in UInt32 before casting to Int. At large shapes this intermediate exceeds 2^32 and wraps around, reporting a bogus throughput (e.g. exactly 0.0 GFLOPS/s when the product lands on a multiple of 2^32). Compute the product in Int so the FLOP count no longer overflows. Small (decode) shapes are unaffected; large (prefill) shapes now report correct throughput. END_PUBLIC Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Signed-off-by: Gabriel <[email protected]>

github-actions Bot added the waiting-on-review label May 29, 2026

gabrieldemarmiesse marked this pull request as ready for review May 29, 2026 09:45

gabrieldemarmiesse requested a review from a team as a code owner May 29, 2026 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernels] Fix UInt32 overflow in flash attention benchmark FLOP count#6623

[Kernels] Fix UInt32 overflow in flash attention benchmark FLOP count#6623
gabrieldemarmiesse wants to merge 1 commit into
modular:mainfrom
gabrieldemarmiesse:fix-flash-attn-bench-flop-overflow

gabrieldemarmiesse commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gabrieldemarmiesse commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of change

Motivation

What changed

Testing

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gabrieldemarmiesse commented May 29, 2026 •

edited

Loading