Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Kernels] Add causal_conv1d kernel benchmark#6624

Open
gabrieldemarmiesse wants to merge 1 commit into
modular:mainfrom
gabrieldemarmiesse:add-causal-conv1d-fwd-benchmark
Open

[Kernels] Add causal_conv1d kernel benchmark#6624
gabrieldemarmiesse wants to merge 1 commit into
modular:mainfrom
gabrieldemarmiesse:add-causal-conv1d-fwd-benchmark

Conversation

@gabrieldemarmiesse

@gabrieldemarmiesse gabrieldemarmiesse commented May 29, 2026

Copy link
Copy Markdown
Contributor

Type of change

  • Bug fix (non-breaking change that fixes an issue)
  • Performance improvement (includes benchmark results below)
  • Documentation update
  • New feature or public API (requires prior proposal or issue approval)
  • Refactor / internal cleanup (no user-visible change)
  • Build, CI, or tooling change

Motivation

I want to improve the perf of the causal-conv1d implemented in MAX. I want to use some code I already wrote for my package. For this I need to show i have better numbers. So we must start with a benchmark.

What changed

Added a benchmark for causal-conv1d

Testing

run the benchmark

Checklist

  • The linked issue above has been reviewed by a maintainer and is
    agreed-upon, or this is a trivial fix that does not need prior
    approval
  • PR is small and focused — I've split larger changes into a sequence of
    smaller PRs where possible (see
    pull request sizes)
  • I ran ./bazelw run format to format my changes
  • I added or updated tests to cover my changes
  • If AI tools assisted with this contribution, I have included an
    Assisted-by: trailer in my commit message or this PR description (see
    AI Tool Use Policy)

Assisted by Claude

BEGIN_PUBLIC
[Kernels][GPU] Add causal_conv1d forward GPU benchmark

Adds a kernel-time benchmark for the channel-first causal_conv1d forward
GPU kernel (state_space). It mirrors the validated test launch config
(kNThreads=128, kNElts=4), times the kernel via the Bench/Bencher
iter_custom harness, and reports achieved memory bandwidth (the op is
memory-bound) as 2 * batch * dim * seqlen * sizeof(dtype).

dtype and conv width are compile-time defines (default bfloat16, width=4
to match the common Mamba config); batch, dim, seqlen and the SiLU
activation flag are runtime args.

Since causal_conv1d lives in //max:state_space, which the globbed GPU
benchmark deps don't include, the target is declared explicitly (and
excluded from the glob) following the existing bench_conv2d/bench_conv3d
pattern.
END_PUBLIC

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Signed-off-by: Gabriel <[email protected]>
@gabrieldemarmiesse gabrieldemarmiesse marked this pull request as ready for review May 29, 2026 12:34
@gabrieldemarmiesse gabrieldemarmiesse requested a review from a team as a code owner May 29, 2026 12:34
@gabrieldemarmiesse gabrieldemarmiesse changed the title [Kernels][GPU] Add causal_conv1d forward GPU benchmark [Kernels] Use causal_conv1d kernel from Tri Dao for 10-90% speedup May 29, 2026
@gabrieldemarmiesse gabrieldemarmiesse marked this pull request as draft May 29, 2026 12:36
@gabrieldemarmiesse gabrieldemarmiesse changed the title [Kernels] Use causal_conv1d kernel from Tri Dao for 10-90% speedup [Kernels] Add causal_conv1d kernel benchmark May 29, 2026
@gabrieldemarmiesse gabrieldemarmiesse marked this pull request as ready for review May 29, 2026 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant