- Target / CPU:
amdgcn-amd-amdhsa / gfx1250
- Toolchain:
AMD clang version 23.0.0git (https://github.com/ROCm/llvm-project.git 43215c7+PATCHED:d17c5aa0e3ea29cde402f58f27e39b6034effa27)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-versions/gfx1250-7.14.0a20260521/lib/llvm/bin
Issue
gfx1250-assembler-bug-mx-scales.zip
For v_wmma_scale_f32_16x16x128_f8f6f4, the ISA constrains the legal
(matrix_a_fmt, matrix_a_scale_fmt, matrix_b_fmt, matrix_b_scale_fmt)
tuples (see table-valid-combinations.txt): matrix-format classes FP8/BF8
and FP6/BF6 require scale E8; class FP4 allows E8 / E5M3 / E4M3;
when both sides are class FP4, the two scales must match.
The integrated assembler ignores these joint constraints on the
matrix_*_fmt / matrix_*_scale_fmt modifiers and accepts arbitrary
combinations, emitting encodings that are not legal per the ISA. Out of
the 225 (A fmt, A scale, B fmt, B scale) tuples, only 43 are valid;
the other 182 are all silently accepted today.
Steps to reproduce
The reproducer is a single self-contained script with no arguments
required. From the directory containing it:
python3 enumerate.py --clean
enumerate.py enumerates all 225 tuples, runs each through llvm-mc
(/opt/rocm/llvm/bin/llvm-mc by default; override with --llvm-mc PATH
or $LLVM_MC), splits the cases into four buckets under results/
(valid-accepted/, valid-rejected/, invalid-accepted/,
invalid-rejected/) as individual .s files, and prints a summary.
Reproduce any single failing case (each .s is self-contained and headed):
/opt/rocm/llvm/bin/llvm-mc -triple=amdgcn-amd-amdhsa -mcpu=gfx1250 \
-filetype=obj -o /dev/null \
results/invalid-accepted/A-FP8-E8__B-FP8-E5M3.s ; echo $?
# 0 (expected: non-zero with a diagnostic on matrix_b_scale_fmt)
CI gate that flips when any fix lands:
N=$(ls results/invalid-accepted/*.s | wc -l)
fail=0
for f in results/invalid-accepted/*.s; do
/opt/rocm/llvm/bin/llvm-mc -triple=amdgcn-amd-amdhsa -mcpu=gfx1250 \
-filetype=obj -o /dev/null "$f" 2>/dev/null || fail=$((fail + 1))
done
[ "$fail" -eq 0 ] \
&& echo "STILL BUGGY: $N/$N invalid combinations accepted" \
|| echo "PROGRESS: $fail/$N now rejected"
Results / summary
enumerate.py output on the toolchain above:
============================================================
Summary (total tested: 225)
============================================================
valid-accepted : 43 (expected behavior)
invalid-rejected : 0 (expected behavior)
invalid-accepted : 182 (BUG: assembler should reject)
valid-rejected : 0 (BUG: assembler should accept)
------------------------------------------------------------
correct : 43 / 225
incorrect (bugs) : 182 / 225
- 182 / 225 invalid
(A fmt, A scale, B fmt, B scale) tuples are
accepted; 0 / 225 are correctly rejected.
- 43 / 225 valid tuples are accepted (no false negatives on the
valid side).
- All 182 bug repros are persisted as standalone
.s files in
results/invalid-accepted/; the CI gate above stays red until the
assembler starts rejecting any of them.
amdgcn-amd-amdhsa/gfx1250AMD clang version 23.0.0git (https://github.com/ROCm/llvm-project.git 43215c7+PATCHED:d17c5aa0e3ea29cde402f58f27e39b6034effa27)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-versions/gfx1250-7.14.0a20260521/lib/llvm/bin
Issue
gfx1250-assembler-bug-mx-scales.zip
For
v_wmma_scale_f32_16x16x128_f8f6f4, the ISA constrains the legal(matrix_a_fmt, matrix_a_scale_fmt, matrix_b_fmt, matrix_b_scale_fmt)tuples (see
table-valid-combinations.txt): matrix-format classesFP8/BF8and
FP6/BF6require scaleE8; classFP4allowsE8/E5M3/E4M3;when both sides are class
FP4, the two scales must match.The integrated assembler ignores these joint constraints on the
matrix_*_fmt/matrix_*_scale_fmtmodifiers and accepts arbitrarycombinations, emitting encodings that are not legal per the ISA. Out of
the 225
(A fmt, A scale, B fmt, B scale)tuples, only 43 are valid;the other 182 are all silently accepted today.
Steps to reproduce
The reproducer is a single self-contained script with no arguments
required. From the directory containing it:
enumerate.pyenumerates all 225 tuples, runs each throughllvm-mc(
/opt/rocm/llvm/bin/llvm-mcby default; override with--llvm-mc PATHor
$LLVM_MC), splits the cases into four buckets underresults/(
valid-accepted/,valid-rejected/,invalid-accepted/,invalid-rejected/) as individual.sfiles, and prints a summary.Reproduce any single failing case (each
.sis self-contained and headed):/opt/rocm/llvm/bin/llvm-mc -triple=amdgcn-amd-amdhsa -mcpu=gfx1250 \ -filetype=obj -o /dev/null \ results/invalid-accepted/A-FP8-E8__B-FP8-E5M3.s ; echo $? # 0 (expected: non-zero with a diagnostic on matrix_b_scale_fmt)CI gate that flips when any fix lands:
Results / summary
enumerate.pyoutput on the toolchain above:(A fmt, A scale, B fmt, B scale)tuples areaccepted; 0 / 225 are correctly rejected.
valid side).
.sfiles inresults/invalid-accepted/; the CI gate above stays red until theassembler starts rejecting any of them.