[Flydsl] flydsl opt based prefill gdn hip#3513
Draft
huizzhan wants to merge 6 commits into
Draft
Conversation
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
Squash of 20 commits on dev/huizzhan/yiijin_prefill_gdn_hip (base 9bab838..tip 9f7ac5d). Combined scope: * HIP path for Gated Delta Rule prefill (K5 hidden-state recurrence): * New csrc kernel ``chunk_gated_delta_rule_fwd_h.cu`` + header / pybind plumbing and a Python host shim (``aiter/ops/chunk_gated_delta_rule_fwd_h.py``). * Best-perf HIP kernel re-targeted to the VK hidden-state layout (``[V, K]``) so it can replace the Triton K5 in the opt_vk pipeline. * Triton VK pipeline (chunk gated delta rule prefill): * Optimized K1+K2 (cumsum + KK^T), K3+K4 (tril solve + w/u recompute), K5 (chunk_delta_h) and K6 (chunk_o) for the ASW shape set. * Added ``exp2`` lowering for both Triton and HIP (g pre-scaled to log2 space) and the matching ``USE_EXP2`` plumbing through the end-to-end opt_vk wrapper. * Refined L2 norm, wy-representation, and solve_tril utilities, and switched autotune-control to support 3D HIP g. * Made the Triton VK pipeline compatible with the Qwen3-Next impl. * Added bf16 ``initial_state`` / ``final_state`` support in the ATOM path (kernel keeps the fp32 accumulator unchanged; bf16 only affects HBM bandwidth/footprint of the SSM state). * FlyDSL K5: * New end-to-end FlyDSL K5 wrapper hooked into ``chunk_gated_delta_rule_opt_vk`` via ``use_chunk_flydsl``. * Optimized FlyDSL prefill GDR kernel and refined the BV autotune / rule-based heuristic; added ``sweep_flydsl_k5_bv.py`` for BV calibration; removed the legacy tuned-BV CSV in favour of the rule-based path. * Switched ``g`` layout from BTH to BHT to match Triton VK / HIP. * Made the FlyDSL wrapper API compatible with the upstream ``@tensor_cache`` plumbing (chunk indices/offsets / rebased cu_seqlens all cache-warm across forwards, no per-forward ``.tolist()`` D2H). * Tests / utils: * Reworked ``test_flydsl_linear_attention_prefill.py``: per-shape K5 perf comparison across vLLM, Triton opt_vk, Triton origin_opt and FlyDSL backends; bench333 trace-derived shape parametrization; correctness coverage against the PyTorch fp32 reference for all three K5 backends. * Moved ``test_gated_delta_rule.py`` out of ``triton_tests/`` and extended it for the new HIP / VK paths. * Updated ``split_tests.sh`` and ``optCompilerConfig.json`` for the new test layout. Squashed commits (chronological, oldest first): 075506b add hip kernel for chunk_gated_delta_rule_fwd (ganyi) 6f67fb08 update best perf hip kernel for hidden state vk layout (ganyi) a3abc4f add exp2 optimization for triton/hip kernels and refine hip api (ganyi) 0973db5 refine code for copilot suggestions (ganyi) cb4b45a add bf16 initial_state/final_state support for ATOM impl (ganyi) ce7c1b4 switch autotune control and add 3D HIP g support (yijin) 5c18312 refine asw cases perf for l2norm, k12 and k34 (ganyi) 7ad34fe optimize gdn triton-vk kernels for asw cases (ganyi) e01cc38 fix exp_ln2 usage (ganyi) 1ee373c Optimize FlyDSL K5 prefill GDR (ganyi) 1923d47 Refine and g layout from BTH to BHT (ganyi) 70c84db fix head-major g usage in flydsl/triton-vk test (ganyi) 66a16bf refine test and fix code style (YiJin) a8fbd08 Refine flydsl api (ganyi) f217652 Remove csv (huizzhan) 6476920 make triton compatible with qwen3next impl (ganyi) e22e970 make tensor cache working (ganyi) dcc769e format (ganyi) 090bb84 add flydsl with tensor cache compatible interface (ganyi) 9f7ac5d format (ganyi) Co-authored-by: Cursor <[email protected]>
9f7ac5d to
afcc2ce
Compare
50c94a9 to
c2da5b6
Compare
c2da5b6 to
649ae22
Compare
f87db42 to
8539751
Compare
8539751 to
f923d12
Compare
dba8581 to
c6be31f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist