Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Issue]: gfx1201 (RDNA4): ISA1201 Tensile GEMM MT64x64x64 computes out-of-bounds address (GPU page fault) on a column-major B operand #7992

@ulises-c

Description

@ulises-c

Problem Description

Summary

On gfx1201 (RDNA4) under ROCm 7.2.3, a Tensile-generated GEMM kernel (Cijk_…_MT64x64x64_…_ISA1201) triggers a reproducible GPU page fault ("page not present"). The faulting access lands at a host-VA address ~1 GB outside the bounds of both GEMM operands, indicating the kernel computes an out-of-bounds address rather than reading a valid-but-unexpected buffer. The fault is deterministic enough to reproduce on a clean GPU and persists under full kernel serialization (AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1), so it is not an asynchronous race. The same kernel and tile succeed on other problem sizes in the same workload; the fault is specific to a call whose B operand is column-major (stride=(1, K)).

The workload is Gemma-4-31B NF4 QLoRA fine-tuning (PyTorch + bitsandbytes); the fault is in the rocBLAS GEMM consuming the dequantized weight, not in the bitsandbytes dequant kernel (verified — see "Ruled out").

Background information

I am doing SFT on Gemm4-31B on an AMD R700 AI PRO and I kept on crashing. If needed I also have wandb.ai logs (system and training info).

More in-depth notes and experiment ran on this PR in my project: ulises-c/csen-346#101
I created an issue on my own repo to track what to post here: ulises-c/csen-346#113

Environment

GPU AMD Radeon AI PRO R9700, gfx1201 / RDNA4 (Navi 48), 32 GB; VBIOS 113-1E4990U-S83, SKU 1E4990U, rev 0xc0
ROCm / HIP 7.2.53211-364a905 (ROCm 7.2.3, HIP 7.2.26015)
rocBLAS 7.2.3-1 (Tensile bundled; no separate package)
hipBLASLt 7.2.3-1
OS / kernel CachyOS, kernel 6.x (7.0.10-2-cachyos), in-kernel (inbox) amdgpu driver (not amdgpu-dkms), linux-firmware-amdgpu 20260519
PyTorch 2.11.0+rocm7.2 (torch.version.hip = 7.2.26015)
bitsandbytes 0.49.2
Backend rocBLAS Tensile path (TORCH_USE_HIPBLASLT=0; see "Ruled out" — hipBLASLt has no kernel for this shape and falls back to Tensile regardless)

The correct-arch Tensile library is present and loaded: /opt/rocm/lib/rocblas/library/ contains both TensileLibrary_lazy_gfx1201.dat and Kernels.so-000-gfx1201.hsaco, and the faulting kernel is ISA1201 (gfx1201's ISA). This rules out the wrong-library-lookup failure mode of #7192 — we are running the genuine gfx1201 kernel, which itself faults.

Faulting kernel (full ShaderName)

Captured at AMD_LOG_LEVEL=3 on a clean GPU. The same kernel variant faults in both the forward and backward GEMM of the workload:

Cijk_Ailk_Bjlk_BBS_BH_Bias_HA_S_SAV_UserArgs_MT64x64x64_MI16x16x1_SN_LDSB0_AFC1_AFEM1_AFEM1_ASEM1_CLR1_CADS0_DTLA0_DTLB0_DTVA0_DTVB1_EPS0_FDSI0_GRPM1_GRVWA8_GRVWB8_GSUAMB_GLS0_ISA1201_IU1_K1_LDSTI0_LBSPPA1024_LBSPPB0_LBSPPM0_LPA32_LPB0_LPM0_LRVW8_LWPMn1_MIAV1_MIWT2_2_MO40_NTn1_NTA0_NTB0_NTC0_NTD0_NTM0_NEPBS0_NLCA1_NLCB2_ONLL0_PGR2_PLR1_PKA0_SIA3_SS0_SPO0_SRVW0_SSO0_SVW8_SK0_SKFTR0_SKXCCM0_TLDS0_ULSGRO0_USL1_UIOFGRO0_USFGROn1_VSn1_VWA2_VWB1_WSGRA0_WSGRB0_WS32_WG32_4_1

Tile: MT64x64x64, MI16x16x1, ISA1201. Flags distinguishing the B-operand path: DTVB1 (B non-default dtype/layout), LBSPPB0 (no B LDS prefetch), NLCB2 (B double-unrolled), VWA2_VWB1.

The faulting call — operands vs. fault address

GEMM dimensions (derived from the logged operands): M=608, N=5376, K=21504, bf16, batch 1. A is row-major contiguous (M×K); B is column-major (K×N), stride=(1, 21504); C is (M×N) bf16 (~6 MB). The pair logged immediately before the fault:

A (gemm input)   shape=(1, 608, 21504)  stride=(13074432, 21504, 1)  contig=True   ptr=0x7f655ece8000  end=0x7f65605d8000
B (dequant wt)   shape=(21504, 5376)    stride=(1, 21504)            contig=False  ptr=0x7f63f01a0000  end=0x7f63fde20000
FAULT: 0x7f6459a00000
Operand range fault inside?
A (1,608,21504) row-major contig 0x7f655ece80000x7f65605d8000 No — fault ~1.0 GB below ptr
B (21504,5376) col-major stride=(1,21504) 0x7f63f01a00000x7f63fde20000 No — fault ~1.2 GB above end

A full scan of the run logs 0 GEMM operands that bracket 0x7f6459a00000. The fault sits in unmapped space ~1 GB from either operand cluster — not adjacent to any buffer boundary, which rules out a simple off-by-N or undersized-descriptor read.

Memory access fault by GPU node-1 on address 0x7f6459a00000.
Reason: Page not present or supervisor privilege.

(Address varies run to run — e.g. 0x7f0ac2e00000, 0x7f29eb600000 in other runs — but is consistently ~1 GB from the operands and 2 MB-aligned.)

Reproducibility

  • Deterministic on a clean GPU, faults reproducibly within a few steps of resuming the workload.
  • Persists under serializationAMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1 → not an async race.
  • Both directions — forward-recompute GEMM and backward GEMM dispatch the identical tile and both fault.

Minimal reproducer

TODO — rocblas-bench line (last artifact, capture in progress). rocblas-bench is installed (/opt/rocm/bin/rocblas-bench). Capturing one ROCBLAS_LAYER=2 run emits the exact rocblas-bench -f gemm_ex --transposeA … --transposeB … -m 608 -n 5376 -k 21504 --lda … --ldb … --ldc … line for the faulting GEMM; running that line directly reproduces the fault independent of PyTorch/bitsandbytes, isolating it entirely within rocBLAS/Tensile. Will be attached here as a follow-up.

Until then the trigger is reachable via any gemm_ex with this selection: bf16, M=608/N=5376/K=21504, transposed/column-major B (the DTVB1 path), on gfx1201 with the Tensile backend.

Ruled out (to save triage time)

Hypothesis Evidence
Asynchronous / concurrency race Persists under AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1.
Caller passed an undersized / wrong-shape operand Both operands logged, valid, contiguous as expected; fault is ~1 GB from either boundary, not adjacent.
bitsandbytes NF4 dequant kernel Dequant kernels succeed every iteration; fault is in the GEMM consuming their output.
hipBLASLt TORCH_USE_HIPBLASLT=1 was tried — hipBLASLt has no kernel for this MT64x64x64 DTVB1 shape (logs Cannot find the function ×6) and falls back to this same Tensile kernel; fault persists.
Allocator placement / fragmentation PYTORCH_HIP_ALLOC_CONF GC-threshold and expandable_segments changes do not move or eliminate the fault (expandable_segments is in fact ignored on gfx1201).
OOM VRAM at fault ~21 GB allocated / ~28 GB reserved of 32 GB — within headroom.

Open / unconfirmed

  • Column-major B as the trigger is observed, not yet proven. Every faulting call has a stride=(1,K) B operand and the DTVB1 kernel flag, and the same tile succeeds on other shapes — strong correlation. Forcing the dequantized weight row-major at the framework (Python) level does not change the outcome and does not test the hypothesis: bitsandbytes sets the transpose via the BLAS transB flag (the A @ Wᵀ call structure), not via the tensor's physical stride, so a Python .contiguous() never reaches the rocBLAS descriptor (the DTVB1 kernel still dispatches and still faults). The definitive test is at the rocBLAS layer: run the captured rocblas-bench line with --transposeB T (reproduces) vs. an equivalent --transposeB N (row-major B → DTVB0, a different tile) and compare. That experiment is folded into the reproducer capture above.
  • Output buffer not instrumented. A page-not-present is formally consistent with an OOB write to C rather than a wild read of an input. Considered implausible (C is ~6 MB and the fault is ~1 GB away) but not ruled out by instrumentation.

Impact

Blocks NF4 QLoRA fine-tuning of large models (Gemma-4-31B) on gfx1201/RDNA4 under ROCm 7.2 — any backward pass that dispatches this tile faults. No userspace workaround found (hipBLASLt lacks the kernel; allocator/serialization knobs don't help).

Possibly related

  • rocm-libraries#6166 — rocBLAS unit tests (dot_ex) on gfx1201 produce the identical Memory access fault … Page not present signature with no PyTorch/bitsandbytes in the picture. Strongest corroboration that gfx1201 rocBLAS kernels page-fault at the library level; likely an in-house-runnable repro of the same class of bug.
  • rocm-libraries#4097 — same kernel family (MT64x64x64 … DTVB1 … ISA1201) on Windows, but the missing-kernel (hipErrorNotFound) mode rather than a present-but-faulting kernel. Suggests the DTVB1 ISA1201 MT64x64x64 tiles are the trouble spot for gfx1201.
  • rocm-libraries#7192 — same GPU (R9700); a wrong-Tensile-file (gfx1200.dat) lookup. Ruled out as our mechanism (we confirmed the gfx1201 library is present and the kernel is ISA1201), included as same-hardware context.

Operating System

CachyOS, kernel 6.x (7.0.10-2-cachyos), in-kernel (inbox) amdgpu driver (not amdgpu-dkms), linux-firmware-amdgpu 20260519

CPU

Ryzen 9 5900X

GPU

AMD R9700 GPU AI PRO

ROCm Version

7.2.3

ROCm Component

Tensile

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

❯ /opt/rocm/bin/rocminfo --support
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.18
Runtime Ext Version:     1.15
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
XNACK enabled:           NO
DMAbuf Support:          YES
VMM Support:             YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 9 5900X 12-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 9 5900X 12-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   4954
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            24
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Memory Properties:
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    65740760(0x3eb1fd8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    65740760(0x3eb1fd8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65740760(0x3eb1fd8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 4
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    65740760(0x3eb1fd8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1201
  Uuid:                    GPU-bdad470151adc4b7
  Marketing Name:          AMD Radeon AI PRO R9700
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      8192(0x2000) KB
    L3:                      65536(0x10000) KB
  Chip ID:                 30033(0x7551)
  ASIC Revision:           1(0x1)
  Cacheline Size:          256(0x100)
  Max Clock Freq. (MHz):   2350
  BDFID:                   2560
  Internal Node ID:        1
  Compute Unit:            64
  SIMDs per CU:            2
  Shader Engines:          4
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        2147483647(0x7fffffff)
    y                        65535(0xffff)
    z                        65535(0xffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 268
  SDMA engine uCode::      662
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    33406976(0x1fdc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1201
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        2147483647(0x7fffffff)
        y                        65535(0xffff)
        z                        65535(0xffff)
      FBarrier Max Size:       32
    ISA 2
      Name:                    amdgcn-amd-amdhsa--gfx12-generic
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        2147483647(0x7fffffff)
        y                        65535(0xffff)
        z                        65535(0xffff)
      FBarrier Max Size:       32
*** Done ***

Additional Information

No response

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions