[Feat] Add GQA/MHA FlashAttention-3 baseline #56

RMLYC · 2025-12-02T10:02:56Z

Add baseline_program interface in Benchmark.py.
Add baseline_profile interface in Benchmark.py
Add GQA/MHA FlashAttention-3 baseline
Add cta swizzle in MHA BWD
Move profile from test to benchmarks
remove tilelang submodule
Update ci.yml

gemini-code-assist · 2025-12-02T10:03:17Z

Summary of Changes

Hello @RMLYC, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the benchmarking infrastructure by introducing a dedicated mechanism to compare custom operations against established baselines. Specifically, it integrates FlashAttention-3 for GQA and MHA, providing a robust way to measure performance improvements. The changes also refine the output reporting to offer a more comprehensive view of both forward and backward pass metrics for both the primary operation and its baseline.

Highlights

Baseline Benchmarking Framework: Introduced baseline_program and baseline_profile interfaces in Benchmark.py to enable performance comparison against a specified baseline operation using CUDA event-based timing.
FlashAttention-3 Integration: Added FlashAttention-3 as a baseline for GQA (Grouped Query Attention) and MHA (Multi-Head Attention) forward and backward passes, allowing direct performance evaluation against this optimized implementation.
Enhanced Benchmarking Output: Updated the profile_run.py script to parse and display detailed performance metrics (latency, TFlops, Bandwidth) for both the main operation and the new baseline, with separate reporting for forward and backward passes. The output table now dynamically filters empty columns for better readability.
Tensor Contiguity and Backward Pass Profiling: Ensured input tensors (Q, K, V) are contiguous in GQA benchmarks for compatibility with FlashAttention-3 and removed the --disable_bwd flag from profiling commands, enabling backward pass profiling by default.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a baseline profiling mechanism for GQA and MHA operators using FlashAttention-3. The changes include adding baseline_profile and baseline_program to the benchmarking framework, implementing them for GQA and MHA, and updating the profiling script to capture and display these new metrics. My review has identified a few critical issues related to incorrect return values in the new baseline programs and a bug in the profiling script's error handling. I've also suggested an improvement to ensure profiling consistency by using torch.no_grad().

gemini-code-assist · 2025-12-02T10:05:06Z

benchmarks/flash_attn/gqa.py

+        if isinstance(out, tuple):
+            out = out[0]
+
+            return out


The baseline_program method has a logic error. If out is not an instance of a tuple, the function implicitly returns None. It should return out in that case as well.

if isinstance(out, tuple): out = out[0] return out

gemini-code-assist · 2025-12-02T10:05:06Z

benchmarks/flash_attn/mha.py

+        if isinstance(out, tuple):
+            out = out[0]
+
+            return out


The baseline_program method has a logic error. If out is not an instance of a tuple, the function implicitly returns None. It should return out in that case as well.

if isinstance(out, tuple): out = out[0] return out

gemini-code-assist · 2025-12-02T10:05:06Z

tests/profile_run.py

+            error_result = {
+                **params, 'tl-latency(ms)': None,
+                'tl-TFlops': None,
+                'tl-Bandwidth(GB/s)': None,
+                'Baseline-latency(ms)': None,
+                'Baseline-TFlops': None,
+                'Baseline-Bandwidth(GB/s)': None
+            }


The keys in error_result do not match the new fieldnames defined for the CSV output. The fieldnames now include prefixes like fwd- and bwd- (e.g., 'fwd-tl-latency(ms)'), but error_result uses keys without these prefixes (e.g., 'tl-latency(ms)'). This will cause issues when writing to the CSV file. The error_result dictionary should contain all the output fieldnames with None as their values.

A more robust way to construct error_result would be to derive the keys from fieldnames dynamically.

output_fields = [f for f in fieldnames if f not in params] error_result = {**params, **{f: None for f in output_fields}}

gemini-code-assist · 2025-12-02T10:05:06Z

benchmarks/benchmark.py

+        # Warmup to get rid of CUDA lazy initialization effects.
+        for _ in range(warmup):
+            _ = baseline_op(*inputs)
+        torch.cuda.synchronize(device=device)
+
+        # CUDA event-based timing for higher precision.
+        start_event = torch.cuda.Event(enable_timing=True)
+        end_event = torch.cuda.Event(enable_timing=True)
+
+        start_event.record()
+        for _ in range(rep):
+            _ = baseline_op(*inputs)
+        end_event.record()
+
+        torch.cuda.synchronize(device=device)
+        total_ms = start_event.elapsed_time(end_event)


The profile method wraps the benchmarked operation within a torch.no_grad() context to prevent gradient computations from affecting performance measurements. The baseline_profile method is missing this, which could lead to inaccurate performance metrics if the baseline_op has gradient tracking enabled. It's recommended to add with torch.no_grad(): for consistency and correctness, especially for forward pass benchmarks.

Suggested change

# Warmup to get rid of CUDA lazy initialization effects.

for _ in range(warmup):

_ = baseline_op(*inputs)

torch.cuda.synchronize(device=device)

# CUDA event-based timing for higher precision.

start_event = torch.cuda.Event(enable_timing=True)

end_event = torch.cuda.Event(enable_timing=True)

start_event.record()

for _ in range(rep):

_ = baseline_op(*inputs)

end_event.record()

torch.cuda.synchronize(device=device)

total_ms = start_event.elapsed_time(end_event)

with torch.no_grad():

# Warmup to get rid of CUDA lazy initialization effects.

for _ in range(warmup):

_ = baseline_op(*inputs)

torch.cuda.synchronize(device=device)

# CUDA event-based timing for higher precision.

start_event = torch.cuda.Event(enable_timing=True)

end_event = torch.cuda.Event(enable_timing=True)

start_event.record()

for _ in range(rep):

_ = baseline_op(*inputs)

end_event.record()

torch.cuda.synchronize(device=device)

total_ms = start_event.elapsed_time(end_event)

RMLYC added 7 commits November 25, 2025 19:43

add fa3 mha

2b89068

add profiler for fa3

16b0413

add backent fa3 profile

bb70111

add FA3 baseline for gqa fwd

55d99ff

add gqa benchmark

f219bc0

add bwd, fwd

8d5e8c3

add fwd and bwd profile show

a68af91

gemini-code-assist bot reviewed Dec 2, 2025

View reviewed changes

RMLYC added 10 commits December 2, 2025 19:57

fix comment

15957bf

fix bug in get error result

8209ad1

add cta swizzle in mha bwd

b94d9e8

move profile to benchmark

ebad534

fix flash attn import

000f6c7

update profile.sh and ci.yml

25528cb

fix bug in profile-run

4e7d13c

fix bug in ci.yml

1a5c974

update ci.yml

e903694

Remove submodule tilelang & flashattn

586e300

RMLYC merged commit 9fec031 into tile-ai:refactor Dec 15, 2025
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] Add GQA/MHA FlashAttention-3 baseline #56

[Feat] Add GQA/MHA FlashAttention-3 baseline #56

Uh oh!

RMLYC commented Dec 2, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 2, 2025

Uh oh!

gemini-code-assist bot Dec 2, 2025

Uh oh!

gemini-code-assist bot Dec 2, 2025

Uh oh!

gemini-code-assist bot Dec 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Feat] Add GQA/MHA FlashAttention-3 baseline #56

[Feat] Add GQA/MHA FlashAttention-3 baseline #56

Uh oh!

Conversation

RMLYC commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Dec 2, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RMLYC commented Dec 2, 2025 •

edited

Loading