Codestin Search App

XiaobingSuper · 2026-06-03T13:47:14Z

Summary

Add Kimi-K2.5-MXFP4-AttnFP8 TP4 A8W8 BPreshuffle untuned GEMM shapes.
Add the validated tuned config generated from those shapes for the attention FP8 PT/PC GEMM path.

Benchmark

Tested with docker vllm_atom_vllm_0603 on GPUs 0,1,2,3. Baseline uses the default A8W8 BPreshuffle config; tuned uses the new Kimi2.5 config.

CON	Baseline Output tok/s	Tuned Output tok/s	Output Gain	Baseline Total tok/s	Tuned Total tok/s	Total Gain	Baseline TPOT ms	Tuned TPOT ms	TPOT Gain	Failed
4	375.83	385.34	+2.53%	4134.13	4238.73	+2.53%	9.05	8.80	+2.84%	0 / 0
8	607.54	628.22	+3.40%	6682.95	6910.42	+3.40%	10.54	10.07	+4.67%	0 / 0
16	859.03	879.29	+2.36%	9449.37	9672.23	+2.36%	14.06	13.95	+0.79%	0 / 0
32	1221.08	1267.37	+3.79%	13431.93	13941.05	+3.79%	18.98	18.79	+1.01%	0 / 0

Test plan

Ran BPreshuffle tuner twice and kept the second validated result.
Benchmarked vLLM serving with CON=4,8,16,32.
Verified all benchmark runs completed with 0 failed requests.

Made with Cursor

Add the validated TP4 attention GEMM tuning results for Kimi-K2.5-MXFP4-AttnFP8 so runtime dispatch can use the measured BPreshuffle kernels. Co-authored-by: Cursor <[email protected]>

Include the untuned TP4 attention GEMM shape list used to generate and validate the Kimi-K2.5 BPreshuffle tuning config. Co-authored-by: Cursor <[email protected]>

github-actions · 2026-06-03T13:47:35Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3520 --add-label <label>

Copilot

Pull request overview

Adds model-specific A8W8 BPreshuffle GEMM shape lists and corresponding tuned kernel selections for Kimi-K2.5-MXFP4-AttnFP8 (TP4), intended to improve the FP8 PT/PC attention GEMM path performance on gfx950.

Changes:

Add Kimi2.5 BPreshuffle untuned GEMM shape CSV (M sweep for several (N,K) pairs, FP8 weights).
Add Kimi2.5 BPreshuffle tuned GEMM CSV with per-shape selected kernels/IDs and measured timings.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
aiter/configs/model_configs/a8w8_bpreshuffle_untuned_gemm_kimik25.csv	New untuned shape list (M,N,K,q_dtype_w) for Kimi2.5 BPreshuffle tuning inputs.
aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_kimik25.csv	New tuned results mapping shapes → kernel selections/metrics for Kimi2.5 BPreshuffle GEMMs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

XiaobingSuper and others added 2 commits June 3, 2026 08:33

tune Kimi2.5 A8W8 BPreshuffle GEMMs

43d1b04

Add the validated TP4 attention GEMM tuning results for Kimi-K2.5-MXFP4-AttnFP8 so runtime dispatch can use the measured BPreshuffle kernels. Co-authored-by: Cursor <[email protected]>

add Kimi2.5 A8W8 BPreshuffle tuning shapes

23aa6d7

Include the untuned TP4 attention GEMM shape list used to generate and validate the Kimi-K2.5 BPreshuffle tuning config. Co-authored-by: Cursor <[email protected]>

XiaobingSuper requested review from a team and Copilot June 3, 2026 13:47

Copilot started reviewing on behalf of XiaobingSuper June 3, 2026 13:47 View session

XiaobingSuper requested a review from valarLip June 3, 2026 13:47

Copilot AI reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tune Kimi2.5 A8W8 BPreshuffle GEMMs#3520

Tune Kimi2.5 A8W8 BPreshuffle GEMMs#3520
XiaobingSuper wants to merge 2 commits into
ROCm:mainfrom
XiaobingSuper:xiaobing/kimik25-a8w8-bpreshuffle-tune

XiaobingSuper commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

XiaobingSuper commented Jun 3, 2026

Summary

Benchmark

Test plan

Uh oh!

github-actions Bot commented Jun 3, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants