Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[inductor][cutlass] SiLU activation fusion via EVT + XPU GEMM template improvements#186198

Draft
xuhancn wants to merge 2 commits into
pytorch:mainfrom
xuhancn:xu_evt_silu_pattern_match
Draft

[inductor][cutlass] SiLU activation fusion via EVT + XPU GEMM template improvements#186198
xuhancn wants to merge 2 commits into
pytorch:mainfrom
xuhancn:xu_evt_silu_pattern_match

Conversation

@xuhancn
Copy link
Copy Markdown
Collaborator

@xuhancn xuhancn commented Jun 4, 2026

Summary

Add SiLU activation fusion into CUTLASS GEMM epilogues via a post-codegen AST rewriting framework (_fuse_activations), plus XPU-specific GEMM template improvements.

Depends on: #186197 (neg/constant EVT ops)
Related: #185824 (same _fuse_activations design for gelu; compatible — will merge cleanly)

Changes

Activation Fusion Framework (python_evt.py)

Post-codegen AST rewriter that pattern-matches decomposed activations in generated EVT Python code and folds them back into native CUTLASS functor calls:

  1. EVT code is generated normally (all decomposed ops accepted via neg/constant from [inductor][cutlass] Add neg() and constant() EVT ops for SiLU epilogue fusion #186197)
  2. Output is parsed as Python AST
  3. All temporaries are inlined to build full expression trees
  4. Known activation decompositions are pattern-matched
  5. Matched patterns are replaced with single native functor calls
  6. Dead-code elimination removes unused temporaries
_ACTIVATION_PATTERNS = (
    _ActivationPattern("silu", "0.0 -", _match_silu),
)
  • _match_silu: Matches x / (1 + exp(0.0 - x)) with commutativity-aware addition matching
  • silu() op: New operation in CutlassEVTOpsMixIn mapping to native CUTLASS SiLU functor
  • erf() op: New operation for future gelu support

XPU GEMM Template (gemm_template.py)

  • _device_cutlass_config: Device-specific config lookup (reads config.xpu or config.cuda based on target)
  • _sort_ops(): XPU-specific tile scoring heuristic — penalizes tiles where tile_M > problem M (wasting computation), prefers larger tiles (better data reuse)
  • Kernel naming uniquification: Appends KERNEL_NAME placeholder to CUTLASS struct names to prevent SYCL kernel class name collisions when multiple .so files define same-named GEMM structs with different epilogues
  • _stride_compatible: Support different-length strides for view/reshape compatibility

Motivation

Inductor decomposes silu(x) into x / (1 + exp(-x)), producing 5+ primitive ops. While each primitive op (neg, constant, exp, add, truediv) is now supported in EVT (via #186197), the decomposed form generates unnecessary compute nodes. The _fuse_activations rewriter folds these back into silu(x), which maps directly to CUTLASS's native SiLU functor — reducing epilogue complexity and enabling better hardware utilization.

The XPU tile scoring heuristic addresses a performance issue where alphabetical sort selects tiles with tile_M >> problem_M, wasting ~50% of compute on padding.

Design Compatibility with #185824

This PR implements the same _fuse_activations design as #185824 (which adds gelu support). When #185824 lands:

  • _fuse_activations function: merge keeps theirs (superset with gelu)
  • constant(): identical implementation (str(float(value)))
  • erf(): identical
  • Our additions are net-new: neg(), silu(), _match_silu, silu _ActivationPattern

Testing

  • test_py_codegen_silu_fused: Validates decomposed SiLU is folded to silu() call
  • test_py_codegen_sigmoid_decomposed: Validates sigmoid is NOT incorrectly folded
  • test_evt_silu_fusion: End-to-end silu(mm) epilogue fusion
  • test_evt_llama_mlp_pattern: End-to-end silu([email protected]) * ([email protected]) pattern
  • test_evt_aux_load_mul: End-to-end GEMM * external buffer via AuxLoad

cc @eellison @etaf

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jun 4, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/186198

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 26 Pending, 2 Unrelated Failures, 7 Unclassified Failures

As of commit bdd6e0c with merge base cc46af7 (image):

NEW FAILURE - The following job has failed:

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@xuhancn xuhancn added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category ciflow/xpu Run XPU CI tasks and removed open source labels Jun 4, 2026
@xuhancn xuhancn force-pushed the xu_evt_silu_pattern_match branch from 98a3351 to daad671 Compare June 4, 2026 07:39
@xuhancn xuhancn marked this pull request as draft June 4, 2026 07:44
@xuhancn xuhancn force-pushed the xu_evt_silu_pattern_match branch from daad671 to bdd6e0c Compare June 4, 2026 08:12
@xuhancn xuhancn changed the title [inductor][cutlass] SiLU reconstituter and XPU GEMM template improvements [inductor][cutlass] SiLU activation fusion via EVT + XPU GEMM template improvements Jun 4, 2026
xuhancn and others added 2 commits June 5, 2026 00:02
…e fusion

Add two missing EVT ops needed by decomposed SiLU (x / (1 + exp(-x))):

- neg(x): Emits (0.0 - x) since CUTLASS PythonASTFrontend lacks visit_UnaryOp
- constant(value, dtype): Returns str(float(value)) for CUTLASS literal parsing

Without these ops, any epilogue containing neg or constant raises
NotImplementedError, preventing SiLU (and sigmoid) from being fused into
CUTLASS GEMM epilogues.

Tests:
- test_py_codegen_neg_constant: validates neg + constant + exp composition
- test_py_codegen_sigmoid_decomposed: validates full sigmoid decomposition 1/(1+exp(-x))

Co-authored-by: Copilot <[email protected]>
…e improvements

Add post-codegen activation fusion framework (_fuse_activations) that
pattern-matches decomposed activations in generated EVT Python code and
folds them back into native CUTLASS functor calls. This approach is
simpler and more extensible than per-op reconstitution.

SiLU support:
- _match_silu: matches x/(1+exp(0.0-x)) pattern in AST
- _ActivationPattern('silu', '0.0 -', _match_silu) entry
- silu() and erf() ops added to CutlassEVTOpsMixIn
- Dead-code elimination removes unused decomposition temporaries

XPU GEMM template improvements:
- _device_cutlass_config: device-specific config lookup (xpu/cuda)
- _sort_ops: XPU tile scoring heuristic (penalize tile_M > problem M)
- Kernel naming uniquification: KERNEL_NAME placeholder in struct names
  to avoid symbol collisions across compiled .so files
- _stride_compatible: support different-length strides for view/reshape

Tests:
- test_py_codegen_silu_fused: validates SiLU is folded to silu() call
- test_evt_silu_fusion: end-to-end silu(mm) epilogue fusion
- test_evt_llama_mlp_pattern: silu([email protected]) * ([email protected]) pattern
- test_evt_aux_load_mul: GEMM * external buffer via AuxLoad

Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks module: inductor open source topic: not user facing topic category

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants