Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Flydsl] flydsl opt based prefill gdn hip#3513

Draft
huizzhan wants to merge 6 commits into
mainfrom
dev/huizzhan/yiijin_prefill_gdn_hip
Draft

[Flydsl] flydsl opt based prefill gdn hip#3513
huizzhan wants to merge 6 commits into
mainfrom
dev/huizzhan/yiijin_prefill_gdn_hip

Conversation

@huizzhan
Copy link
Copy Markdown
Contributor

@huizzhan huizzhan commented Jun 3, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3513 --add-label <label>

Squash of 20 commits on dev/huizzhan/yiijin_prefill_gdn_hip
(base 9bab838..tip 9f7ac5d). Combined scope:

* HIP path for Gated Delta Rule prefill (K5 hidden-state recurrence):
  * New csrc kernel ``chunk_gated_delta_rule_fwd_h.cu`` + header /
    pybind plumbing and a Python host shim
    (``aiter/ops/chunk_gated_delta_rule_fwd_h.py``).
  * Best-perf HIP kernel re-targeted to the VK hidden-state layout
    (``[V, K]``) so it can replace the Triton K5 in the opt_vk pipeline.

* Triton VK pipeline (chunk gated delta rule prefill):
  * Optimized K1+K2 (cumsum + KK^T), K3+K4 (tril solve + w/u recompute),
    K5 (chunk_delta_h) and K6 (chunk_o) for the ASW shape set.
  * Added ``exp2`` lowering for both Triton and HIP (g pre-scaled to
    log2 space) and the matching ``USE_EXP2`` plumbing through the
    end-to-end opt_vk wrapper.
  * Refined L2 norm, wy-representation, and solve_tril utilities, and
    switched autotune-control to support 3D HIP g.
  * Made the Triton VK pipeline compatible with the Qwen3-Next impl.
  * Added bf16 ``initial_state`` / ``final_state`` support in the ATOM
    path (kernel keeps the fp32 accumulator unchanged; bf16 only
    affects HBM bandwidth/footprint of the SSM state).

* FlyDSL K5:
  * New end-to-end FlyDSL K5 wrapper hooked into
    ``chunk_gated_delta_rule_opt_vk`` via ``use_chunk_flydsl``.
  * Optimized FlyDSL prefill GDR kernel and refined the BV autotune /
    rule-based heuristic; added ``sweep_flydsl_k5_bv.py`` for BV
    calibration; removed the legacy tuned-BV CSV in favour of the
    rule-based path.
  * Switched ``g`` layout from BTH to BHT to match Triton VK / HIP.
  * Made the FlyDSL wrapper API compatible with the upstream
    ``@tensor_cache`` plumbing (chunk indices/offsets / rebased
    cu_seqlens all cache-warm across forwards, no per-forward
    ``.tolist()`` D2H).

* Tests / utils:
  * Reworked ``test_flydsl_linear_attention_prefill.py``: per-shape K5
    perf comparison across vLLM, Triton opt_vk, Triton origin_opt and
    FlyDSL backends; bench333 trace-derived shape parametrization;
    correctness coverage against the PyTorch fp32 reference for all
    three K5 backends.
  * Moved ``test_gated_delta_rule.py`` out of ``triton_tests/`` and
    extended it for the new HIP / VK paths.
  * Updated ``split_tests.sh`` and ``optCompilerConfig.json`` for the
    new test layout.

Squashed commits (chronological, oldest first):

  075506b add hip kernel for chunk_gated_delta_rule_fwd (ganyi)
  6f67fb08 update best perf hip kernel for hidden state vk layout (ganyi)
  a3abc4f add exp2 optimization for triton/hip kernels and refine hip api (ganyi)
  0973db5 refine code for copilot suggestions (ganyi)
  cb4b45a add bf16 initial_state/final_state support for ATOM impl (ganyi)
  ce7c1b4 switch autotune control and add 3D HIP g support (yijin)
  5c18312 refine asw cases perf for l2norm, k12 and k34 (ganyi)
  7ad34fe optimize gdn triton-vk kernels for asw cases (ganyi)
  e01cc38 fix exp_ln2 usage (ganyi)
  1ee373c Optimize FlyDSL K5 prefill GDR (ganyi)
  1923d47 Refine and g layout from BTH to BHT (ganyi)
  70c84db fix head-major g usage in flydsl/triton-vk test (ganyi)
  66a16bf refine test and fix code style (YiJin)
  a8fbd08 Refine flydsl api (ganyi)
  f217652 Remove csv (huizzhan)
  6476920 make triton compatible with qwen3next impl (ganyi)
  e22e970 make tensor cache working (ganyi)
  dcc769e format (ganyi)
  090bb84 add flydsl with tensor cache compatible interface (ganyi)
  9f7ac5d format (ganyi)

Co-authored-by: Cursor <[email protected]>
@huizzhan huizzhan force-pushed the dev/huizzhan/yiijin_prefill_gdn_hip branch from 9f7ac5d to afcc2ce Compare June 3, 2026 09:55
@huizzhan huizzhan force-pushed the dev/huizzhan/yiijin_prefill_gdn_hip branch from 50c94a9 to c2da5b6 Compare June 3, 2026 11:47
@huizzhan huizzhan force-pushed the dev/huizzhan/yiijin_prefill_gdn_hip branch from c2da5b6 to 649ae22 Compare June 4, 2026 04:32
@huizzhan huizzhan force-pushed the dev/huizzhan/yiijin_prefill_gdn_hip branch from f87db42 to 8539751 Compare June 4, 2026 04:59
@huizzhan huizzhan force-pushed the dev/huizzhan/yiijin_prefill_gdn_hip branch from 8539751 to f923d12 Compare June 4, 2026 07:29
@huizzhan huizzhan force-pushed the dev/huizzhan/yiijin_prefill_gdn_hip branch from dba8581 to c6be31f Compare June 4, 2026 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant