[Issue]: gfx1201 (RDNA4): ISA1201 Tensile GEMM MT64x64x64 computes out-of-bounds address (GPU page fault) on a column-major B operand

### Problem Description

# Summary

On **gfx1201 (RDNA4)** under **ROCm 7.2.3**, a Tensile-generated GEMM kernel (`Cijk_…_MT64x64x64_…_ISA1201`) triggers a reproducible **GPU page fault** ("page not present"). The faulting access lands at a host-VA address **~1 GB outside the bounds of both GEMM operands**, indicating the kernel computes an out-of-bounds address rather than reading a valid-but-unexpected buffer. The fault is deterministic enough to reproduce on a clean GPU and **persists under full kernel serialization** (`AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1`), so it is not an asynchronous race. The same kernel and tile succeed on other problem sizes in the same workload; the fault is specific to a call whose **B operand is column-major** (`stride=(1, K)`).

The workload is Gemma-4-31B NF4 QLoRA fine-tuning (PyTorch + bitsandbytes); the fault is in the rocBLAS GEMM consuming the dequantized weight, **not** in the bitsandbytes dequant kernel (verified — see "Ruled out").

# Background information

I am doing SFT on Gemm4-31B on an AMD R700 AI PRO and I kept on crashing. If needed I also have wandb.ai logs (system and training info).

More in-depth notes and experiment ran on this PR in my project: https://github.com/ulises-c/csen-346/pull/101
I created an issue on my own repo to track what to post here: https://github.com/ulises-c/csen-346/issues/113

# Environment

| | |
|---|---|
| GPU | AMD Radeon AI PRO R9700, **gfx1201 / RDNA4 (Navi 48)**, 32 GB; VBIOS `113-1E4990U-S83`, SKU 1E4990U, rev 0xc0 |
| ROCm / HIP | **7.2.53211-364a905** (ROCm 7.2.3, HIP 7.2.26015) |
| rocBLAS | **7.2.3-1** (Tensile bundled; no separate package) |
| hipBLASLt | **7.2.3-1** |
| OS / kernel | CachyOS, kernel **6.x (`7.0.10-2-cachyos`)**, **in-kernel (inbox) amdgpu** driver (not amdgpu-dkms), linux-firmware-amdgpu `20260519` |
| PyTorch | **2.11.0+rocm7.2** (`torch.version.hip` = 7.2.26015) |
| bitsandbytes | 0.49.2 |
| Backend | rocBLAS Tensile path (`TORCH_USE_HIPBLASLT=0`; see "Ruled out" — hipBLASLt has no kernel for this shape and falls back to Tensile regardless) |

> The correct-arch Tensile library **is** present and loaded: `/opt/rocm/lib/rocblas/library/` contains both `TensileLibrary_lazy_gfx1201.dat` and `Kernels.so-000-gfx1201.hsaco`, and the faulting kernel is `ISA1201` (gfx1201's ISA). This rules out the wrong-library-lookup failure mode of [#7192](https://github.com/ROCm/rocm-libraries/issues/7192) — we are running the genuine gfx1201 kernel, which itself faults.

# Faulting kernel (full ShaderName)

Captured at `AMD_LOG_LEVEL=3` on a clean GPU. The **same kernel variant faults in both the forward and backward GEMM** of the workload:

```
Cijk_Ailk_Bjlk_BBS_BH_Bias_HA_S_SAV_UserArgs_MT64x64x64_MI16x16x1_SN_LDSB0_AFC1_AFEM1_AFEM1_ASEM1_CLR1_CADS0_DTLA0_DTLB0_DTVA0_DTVB1_EPS0_FDSI0_GRPM1_GRVWA8_GRVWB8_GSUAMB_GLS0_ISA1201_IU1_K1_LDSTI0_LBSPPA1024_LBSPPB0_LBSPPM0_LPA32_LPB0_LPM0_LRVW8_LWPMn1_MIAV1_MIWT2_2_MO40_NTn1_NTA0_NTB0_NTC0_NTD0_NTM0_NEPBS0_NLCA1_NLCB2_ONLL0_PGR2_PLR1_PKA0_SIA3_SS0_SPO0_SRVW0_SSO0_SVW8_SK0_SKFTR0_SKXCCM0_TLDS0_ULSGRO0_USL1_UIOFGRO0_USFGROn1_VSn1_VWA2_VWB1_WSGRA0_WSGRB0_WS32_WG32_4_1
```

Tile: `MT64x64x64`, `MI16x16x1`, `ISA1201`. Flags distinguishing the B-operand path: `DTVB1` (B non-default dtype/layout), `LBSPPB0` (no B LDS prefetch), `NLCB2` (B double-unrolled), `VWA2_VWB1`.

# The faulting call — operands vs. fault address

GEMM dimensions (derived from the logged operands): **M=608, N=5376, K=21504**, bf16, batch 1. A is row-major contiguous (M×K); **B is column-major (K×N), `stride=(1, 21504)`**; C is (M×N) bf16 (~6 MB). The pair logged immediately before the fault:

```
A (gemm input)   shape=(1, 608, 21504)  stride=(13074432, 21504, 1)  contig=True   ptr=0x7f655ece8000  end=0x7f65605d8000
B (dequant wt)   shape=(21504, 5376)    stride=(1, 21504)            contig=False  ptr=0x7f63f01a0000  end=0x7f63fde20000
FAULT: 0x7f6459a00000
```

| Operand | range | fault inside? |
|---|---|---|
| A `(1,608,21504)` row-major contig | `0x7f655ece8000` … `0x7f65605d8000` | **No** — fault ~1.0 GB *below* `ptr` |
| B `(21504,5376)` col-major `stride=(1,21504)` | `0x7f63f01a0000` … `0x7f63fde20000` | **No** — fault ~1.2 GB *above* `end` |

A full scan of the run logs **0 GEMM operands that bracket `0x7f6459a00000`**. The fault sits in unmapped space ~1 GB from either operand cluster — not adjacent to any buffer boundary, which rules out a simple off-by-N or undersized-descriptor read.

```
Memory access fault by GPU node-1 on address 0x7f6459a00000.
Reason: Page not present or supervisor privilege.
```

(Address varies run to run — e.g. `0x7f0ac2e00000`, `0x7f29eb600000` in other runs — but is consistently ~1 GB from the operands and 2 MB-aligned.)

# Reproducibility

- **Deterministic on a clean GPU**, faults reproducibly within a few steps of resuming the workload.
- **Persists under serialization** — `AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1` → not an async race.
- **Both directions** — forward-recompute GEMM and backward GEMM dispatch the identical tile and both fault.

# Minimal reproducer

**TODO — `rocblas-bench` line (last artifact, capture in progress).** `rocblas-bench` is installed (`/opt/rocm/bin/rocblas-bench`). Capturing one `ROCBLAS_LAYER=2` run emits the exact `rocblas-bench -f gemm_ex --transposeA … --transposeB … -m 608 -n 5376 -k 21504 --lda … --ldb … --ldc …` line for the faulting GEMM; running that line directly reproduces the fault **independent of PyTorch/bitsandbytes**, isolating it entirely within rocBLAS/Tensile. **Will be attached here as a follow-up.**

Until then the trigger is reachable via any `gemm_ex` with this selection: bf16, M=608/N=5376/K=21504, **transposed/column-major B** (the `DTVB1` path), on gfx1201 with the Tensile backend.

# Ruled out (to save triage time)

| Hypothesis | Evidence |
|---|---|
| Asynchronous / concurrency race | Persists under `AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1`. |
| Caller passed an undersized / wrong-shape operand | Both operands logged, valid, contiguous as expected; fault is ~1 GB from either boundary, not adjacent. |
| bitsandbytes NF4 dequant kernel | Dequant kernels succeed every iteration; fault is in the GEMM consuming their output. |
| hipBLASLt | `TORCH_USE_HIPBLASLT=1` was tried — hipBLASLt has **no kernel** for this `MT64x64x64 DTVB1` shape (logs `Cannot find the function` ×6) and falls back to this same Tensile kernel; fault persists. |
| Allocator placement / fragmentation | `PYTORCH_HIP_ALLOC_CONF` GC-threshold and `expandable_segments` changes do not move or eliminate the fault (`expandable_segments` is in fact ignored on gfx1201). |
| OOM | VRAM at fault ~21 GB allocated / ~28 GB reserved of 32 GB — within headroom. |

# Open / unconfirmed

- **Column-major B as the trigger is observed, not yet proven.** Every faulting call has a `stride=(1,K)` B operand and the `DTVB1` kernel flag, and the same tile succeeds on other shapes — strong correlation. Forcing the dequantized weight row-major at the framework (Python) level does **not** change the outcome and does **not** test the hypothesis: bitsandbytes sets the transpose via the BLAS `transB` flag (the `A @ Wᵀ` call structure), not via the tensor's physical stride, so a Python `.contiguous()` never reaches the rocBLAS descriptor (the `DTVB1` kernel still dispatches and still faults). The definitive test is at the rocBLAS layer: run the captured `rocblas-bench` line with `--transposeB T` (reproduces) vs. an equivalent `--transposeB N` (row-major B → `DTVB0`, a different tile) and compare. That experiment is folded into the reproducer capture above.
- **Output buffer not instrumented.** A page-not-present is formally consistent with an OOB *write* to C rather than a wild *read* of an input. Considered implausible (C is ~6 MB and the fault is ~1 GB away) but not ruled out by instrumentation.

# Impact

Blocks NF4 QLoRA fine-tuning of large models (Gemma-4-31B) on gfx1201/RDNA4 under ROCm 7.2 — any backward pass that dispatches this tile faults. No userspace workaround found (hipBLASLt lacks the kernel; allocator/serialization knobs don't help).

# Possibly related

- [rocm-libraries#6166](https://github.com/ROCm/rocm-libraries/issues/6166) — rocBLAS unit tests (`dot_ex`) on gfx1201 produce the **identical** `Memory access fault … Page not present` signature with **no** PyTorch/bitsandbytes in the picture. Strongest corroboration that gfx1201 rocBLAS kernels page-fault at the library level; likely an in-house-runnable repro of the same class of bug.
- [rocm-libraries#4097](https://github.com/ROCm/rocm-libraries/issues/4097) — same kernel family (`MT64x64x64 … DTVB1 … ISA1201`) on Windows, but the *missing-kernel* (`hipErrorNotFound`) mode rather than a present-but-faulting kernel. Suggests the `DTVB1` ISA1201 `MT64x64x64` tiles are the trouble spot for gfx1201.
- [rocm-libraries#7192](https://github.com/ROCm/rocm-libraries/issues/7192) — same GPU (R9700); a wrong-Tensile-file (`gfx1200.dat`) lookup. **Ruled out as our mechanism** (we confirmed the gfx1201 library is present and the kernel is `ISA1201`), included as same-hardware context.


### Operating System

CachyOS, kernel 6.x (7.0.10-2-cachyos), in-kernel (inbox) amdgpu driver (not amdgpu-dkms), linux-firmware-amdgpu 20260519

### CPU

Ryzen 9 5900X

### GPU

AMD R9700 GPU AI PRO

### ROCm Version

7.2.3

### ROCm Component

Tensile

### Steps to Reproduce

_No response_

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

```
❯ /opt/rocm/bin/rocminfo --support
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.18
Runtime Ext Version:     1.15
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
XNACK enabled:           NO
DMAbuf Support:          YES
VMM Support:             YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 9 5900X 12-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 9 5900X 12-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   4954
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            24
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Memory Properties:
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    65740760(0x3eb1fd8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    65740760(0x3eb1fd8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65740760(0x3eb1fd8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 4
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    65740760(0x3eb1fd8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1201
  Uuid:                    GPU-bdad470151adc4b7
  Marketing Name:          AMD Radeon AI PRO R9700
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      8192(0x2000) KB
    L3:                      65536(0x10000) KB
  Chip ID:                 30033(0x7551)
  ASIC Revision:           1(0x1)
  Cacheline Size:          256(0x100)
  Max Clock Freq. (MHz):   2350
  BDFID:                   2560
  Internal Node ID:        1
  Compute Unit:            64
  SIMDs per CU:            2
  Shader Engines:          4
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        2147483647(0x7fffffff)
    y                        65535(0xffff)
    z                        65535(0xffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 268
  SDMA engine uCode::      662
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    33406976(0x1fdc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1201
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        2147483647(0x7fffffff)
        y                        65535(0xffff)
        z                        65535(0xffff)
      FBarrier Max Size:       32
    ISA 2
      Name:                    amdgcn-amd-amdhsa--gfx12-generic
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        2147483647(0x7fffffff)
        y                        65535(0xffff)
        z                        65535(0xffff)
      FBarrier Max Size:       32
*** Done ***
```

### Additional Information

_No response_

Operand	range	fault inside?
A `(1,608,21504)` row-major contig	`0x7f655ece8000` … `0x7f65605d8000`	No — fault ~1.0 GB below `ptr`
B `(21504,5376)` col-major `stride=(1,21504)`	`0x7f63f01a0000` … `0x7f63fde20000`	No — fault ~1.2 GB above `end`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: gfx1201 (RDNA4): ISA1201 Tensile GEMM MT64x64x64 computes out-of-bounds address (GPU page fault) on a column-major B operand #7992

Problem Description

Summary

Background information

Environment

Faulting kernel (full ShaderName)

The faulting call — operands vs. fault address

Reproducibility

Minimal reproducer

Ruled out (to save triage time)

Open / unconfirmed

Impact

Possibly related

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development


GPU	AMD Radeon AI PRO R9700, gfx1201 / RDNA4 (Navi 48), 32 GB; VBIOS `113-1E4990U-S83`, SKU 1E4990U, rev 0xc0
ROCm / HIP	7.2.53211-364a905 (ROCm 7.2.3, HIP 7.2.26015)
rocBLAS	7.2.3-1 (Tensile bundled; no separate package)
hipBLASLt	7.2.3-1
OS / kernel	CachyOS, kernel 6.x (`7.0.10-2-cachyos`), in-kernel (inbox) amdgpu driver (not amdgpu-dkms), linux-firmware-amdgpu `20260519`
PyTorch	2.11.0+rocm7.2 (`torch.version.hip` = 7.2.26015)
bitsandbytes	0.49.2
Backend	rocBLAS Tensile path (`TORCH_USE_HIPBLASLT=0`; see "Ruled out" — hipBLASLt has no kernel for this shape and falls back to Tensile regardless)

Hypothesis	Evidence
Asynchronous / concurrency race	Persists under `AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1`.
Caller passed an undersized / wrong-shape operand	Both operands logged, valid, contiguous as expected; fault is ~1 GB from either boundary, not adjacent.
bitsandbytes NF4 dequant kernel	Dequant kernels succeed every iteration; fault is in the GEMM consuming their output.
hipBLASLt	`TORCH_USE_HIPBLASLT=1` was tried — hipBLASLt has no kernel for this `MT64x64x64 DTVB1` shape (logs `Cannot find the function` ×6) and falls back to this same Tensile kernel; fault persists.
Allocator placement / fragmentation	`PYTORCH_HIP_ALLOC_CONF` GC-threshold and `expandable_segments` changes do not move or eliminate the fault (`expandable_segments` is in fact ignored on gfx1201).
OOM	VRAM at fault ~21 GB allocated / ~28 GB reserved of 32 GB — within headroom.

[Issue]: gfx1201 (RDNA4): ISA1201 Tensile GEMM MT64x64x64 computes out-of-bounds address (GPU page fault) on a column-major B operand #7992

Description

Problem Description

Summary

Background information

Environment

Faulting kernel (full ShaderName)

The faulting call — operands vs. fault address

Reproducibility

Minimal reproducer

Ruled out (to save triage time)

Open / unconfirmed

Impact

Possibly related

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions