Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Tune Kimi2.5 A8W8 BPreshuffle GEMMs#3520

Open
XiaobingSuper wants to merge 2 commits into
ROCm:mainfrom
XiaobingSuper:xiaobing/kimik25-a8w8-bpreshuffle-tune
Open

Tune Kimi2.5 A8W8 BPreshuffle GEMMs#3520
XiaobingSuper wants to merge 2 commits into
ROCm:mainfrom
XiaobingSuper:xiaobing/kimik25-a8w8-bpreshuffle-tune

Conversation

@XiaobingSuper
Copy link
Copy Markdown
Contributor

Summary

  • Add Kimi-K2.5-MXFP4-AttnFP8 TP4 A8W8 BPreshuffle untuned GEMM shapes.
  • Add the validated tuned config generated from those shapes for the attention FP8 PT/PC GEMM path.

Benchmark

Tested with docker vllm_atom_vllm_0603 on GPUs 0,1,2,3. Baseline uses the default A8W8 BPreshuffle config; tuned uses the new Kimi2.5 config.

CON Baseline Output tok/s Tuned Output tok/s Output Gain Baseline Total tok/s Tuned Total tok/s Total Gain Baseline TPOT ms Tuned TPOT ms TPOT Gain Failed
4 375.83 385.34 +2.53% 4134.13 4238.73 +2.53% 9.05 8.80 +2.84% 0 / 0
8 607.54 628.22 +3.40% 6682.95 6910.42 +3.40% 10.54 10.07 +4.67% 0 / 0
16 859.03 879.29 +2.36% 9449.37 9672.23 +2.36% 14.06 13.95 +0.79% 0 / 0
32 1221.08 1267.37 +3.79% 13431.93 13941.05 +3.79% 18.98 18.79 +1.01% 0 / 0

Test plan

  • Ran BPreshuffle tuner twice and kept the second validated result.
  • Benchmarked vLLM serving with CON=4,8,16,32.
  • Verified all benchmark runs completed with 0 failed requests.

Made with Cursor

XiaobingSuper and others added 2 commits June 3, 2026 08:33
Add the validated TP4 attention GEMM tuning results for Kimi-K2.5-MXFP4-AttnFP8 so runtime dispatch can use the measured BPreshuffle kernels.

Co-authored-by: Cursor <[email protected]>
Include the untuned TP4 attention GEMM shape list used to generate and validate the Kimi-K2.5 BPreshuffle tuning config.

Co-authored-by: Cursor <[email protected]>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3520 --add-label <label>

@XiaobingSuper XiaobingSuper requested a review from valarLip June 3, 2026 13:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds model-specific A8W8 BPreshuffle GEMM shape lists and corresponding tuned kernel selections for Kimi-K2.5-MXFP4-AttnFP8 (TP4), intended to improve the FP8 PT/PC attention GEMM path performance on gfx950.

Changes:

  • Add Kimi2.5 BPreshuffle untuned GEMM shape CSV (M sweep for several (N,K) pairs, FP8 weights).
  • Add Kimi2.5 BPreshuffle tuned GEMM CSV with per-shape selected kernels/IDs and measured timings.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
aiter/configs/model_configs/a8w8_bpreshuffle_untuned_gemm_kimik25.csv New untuned shape list (M,N,K,q_dtype_w) for Kimi2.5 BPreshuffle tuning inputs.
aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_kimik25.csv New tuned results mapping shapes → kernel selections/metrics for Kimi2.5 BPreshuffle GEMMs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants