Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@RMLYC
Copy link
Collaborator

@RMLYC RMLYC commented Dec 2, 2025

  1. Add baseline_program interface in Benchmark.py.
  2. Add baseline_profile interface in Benchmark.py
  3. Add GQA/MHA FlashAttention-3 baseline
  4. Add cta swizzle in MHA BWD
  5. Move profile from test to benchmarks
  6. remove tilelang submodule
  7. Update ci.yml

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @RMLYC, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the benchmarking infrastructure by introducing a dedicated mechanism to compare custom operations against established baselines. Specifically, it integrates FlashAttention-3 for GQA and MHA, providing a robust way to measure performance improvements. The changes also refine the output reporting to offer a more comprehensive view of both forward and backward pass metrics for both the primary operation and its baseline.

Highlights

  • Baseline Benchmarking Framework: Introduced baseline_program and baseline_profile interfaces in Benchmark.py to enable performance comparison against a specified baseline operation using CUDA event-based timing.
  • FlashAttention-3 Integration: Added FlashAttention-3 as a baseline for GQA (Grouped Query Attention) and MHA (Multi-Head Attention) forward and backward passes, allowing direct performance evaluation against this optimized implementation.
  • Enhanced Benchmarking Output: Updated the profile_run.py script to parse and display detailed performance metrics (latency, TFlops, Bandwidth) for both the main operation and the new baseline, with separate reporting for forward and backward passes. The output table now dynamically filters empty columns for better readability.
  • Tensor Contiguity and Backward Pass Profiling: Ensured input tensors (Q, K, V) are contiguous in GQA benchmarks for compatibility with FlashAttention-3 and removed the --disable_bwd flag from profiling commands, enabling backward pass profiling by default.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a baseline profiling mechanism for GQA and MHA operators using FlashAttention-3. The changes include adding baseline_profile and baseline_program to the benchmarking framework, implementing them for GQA and MHA, and updating the profiling script to capture and display these new metrics. My review has identified a few critical issues related to incorrect return values in the new baseline programs and a bug in the profiling script's error handling. I've also suggested an improvement to ensure profiling consistency by using torch.no_grad().

Comment on lines 65 to 68
if isinstance(out, tuple):
out = out[0]

return out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The baseline_program method has a logic error. If out is not an instance of a tuple, the function implicitly returns None. It should return out in that case as well.

        if isinstance(out, tuple):
            out = out[0]
        return out

Comment on lines 60 to 63
if isinstance(out, tuple):
out = out[0]

return out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The baseline_program method has a logic error. If out is not an instance of a tuple, the function implicitly returns None. It should return out in that case as well.

        if isinstance(out, tuple):
            out = out[0]
        return out

Comment on lines 281 to 288
error_result = {
**params, 'tl-latency(ms)': None,
'tl-TFlops': None,
'tl-Bandwidth(GB/s)': None,
'Baseline-latency(ms)': None,
'Baseline-TFlops': None,
'Baseline-Bandwidth(GB/s)': None
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The keys in error_result do not match the new fieldnames defined for the CSV output. The fieldnames now include prefixes like fwd- and bwd- (e.g., 'fwd-tl-latency(ms)'), but error_result uses keys without these prefixes (e.g., 'tl-latency(ms)'). This will cause issues when writing to the CSV file. The error_result dictionary should contain all the output fieldnames with None as their values.

A more robust way to construct error_result would be to derive the keys from fieldnames dynamically.

            output_fields = [f for f in fieldnames if f not in params]
            error_result = {**params, **{f: None for f in output_fields}}

Comment on lines +137 to +152
# Warmup to get rid of CUDA lazy initialization effects.
for _ in range(warmup):
_ = baseline_op(*inputs)
torch.cuda.synchronize(device=device)

# CUDA event-based timing for higher precision.
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

start_event.record()
for _ in range(rep):
_ = baseline_op(*inputs)
end_event.record()

torch.cuda.synchronize(device=device)
total_ms = start_event.elapsed_time(end_event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The profile method wraps the benchmarked operation within a torch.no_grad() context to prevent gradient computations from affecting performance measurements. The baseline_profile method is missing this, which could lead to inaccurate performance metrics if the baseline_op has gradient tracking enabled. It's recommended to add with torch.no_grad(): for consistency and correctness, especially for forward pass benchmarks.

Suggested change
# Warmup to get rid of CUDA lazy initialization effects.
for _ in range(warmup):
_ = baseline_op(*inputs)
torch.cuda.synchronize(device=device)
# CUDA event-based timing for higher precision.
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(rep):
_ = baseline_op(*inputs)
end_event.record()
torch.cuda.synchronize(device=device)
total_ms = start_event.elapsed_time(end_event)
with torch.no_grad():
# Warmup to get rid of CUDA lazy initialization effects.
for _ in range(warmup):
_ = baseline_op(*inputs)
torch.cuda.synchronize(device=device)
# CUDA event-based timing for higher precision.
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(rep):
_ = baseline_op(*inputs)
end_event.record()
torch.cuda.synchronize(device=device)
total_ms = start_event.elapsed_time(end_event)

@RMLYC RMLYC merged commit 9fec031 into tile-ai:refactor Dec 15, 2025
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant