Problem Description
Summary
On gfx1201 (RDNA4) under ROCm 7.2.3, a Tensile-generated GEMM kernel (Cijk_…_MT64x64x64_…_ISA1201) triggers a reproducible GPU page fault ("page not present"). The faulting access lands at a host-VA address ~1 GB outside the bounds of both GEMM operands, indicating the kernel computes an out-of-bounds address rather than reading a valid-but-unexpected buffer. The fault is deterministic enough to reproduce on a clean GPU and persists under full kernel serialization (AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1), so it is not an asynchronous race. The same kernel and tile succeed on other problem sizes in the same workload; the fault is specific to a call whose B operand is column-major (stride=(1, K)).
The workload is Gemma-4-31B NF4 QLoRA fine-tuning (PyTorch + bitsandbytes); the fault is in the rocBLAS GEMM consuming the dequantized weight, not in the bitsandbytes dequant kernel (verified — see "Ruled out").
Background information
I am doing SFT on Gemm4-31B on an AMD R700 AI PRO and I kept on crashing. If needed I also have wandb.ai logs (system and training info).
More in-depth notes and experiment ran on this PR in my project: ulises-c/csen-346#101
I created an issue on my own repo to track what to post here: ulises-c/csen-346#113
Environment
|
|
| GPU |
AMD Radeon AI PRO R9700, gfx1201 / RDNA4 (Navi 48), 32 GB; VBIOS 113-1E4990U-S83, SKU 1E4990U, rev 0xc0 |
| ROCm / HIP |
7.2.53211-364a905 (ROCm 7.2.3, HIP 7.2.26015) |
| rocBLAS |
7.2.3-1 (Tensile bundled; no separate package) |
| hipBLASLt |
7.2.3-1 |
| OS / kernel |
CachyOS, kernel 6.x (7.0.10-2-cachyos), in-kernel (inbox) amdgpu driver (not amdgpu-dkms), linux-firmware-amdgpu 20260519 |
| PyTorch |
2.11.0+rocm7.2 (torch.version.hip = 7.2.26015) |
| bitsandbytes |
0.49.2 |
| Backend |
rocBLAS Tensile path (TORCH_USE_HIPBLASLT=0; see "Ruled out" — hipBLASLt has no kernel for this shape and falls back to Tensile regardless) |
The correct-arch Tensile library is present and loaded: /opt/rocm/lib/rocblas/library/ contains both TensileLibrary_lazy_gfx1201.dat and Kernels.so-000-gfx1201.hsaco, and the faulting kernel is ISA1201 (gfx1201's ISA). This rules out the wrong-library-lookup failure mode of #7192 — we are running the genuine gfx1201 kernel, which itself faults.
Faulting kernel (full ShaderName)
Captured at AMD_LOG_LEVEL=3 on a clean GPU. The same kernel variant faults in both the forward and backward GEMM of the workload:
Cijk_Ailk_Bjlk_BBS_BH_Bias_HA_S_SAV_UserArgs_MT64x64x64_MI16x16x1_SN_LDSB0_AFC1_AFEM1_AFEM1_ASEM1_CLR1_CADS0_DTLA0_DTLB0_DTVA0_DTVB1_EPS0_FDSI0_GRPM1_GRVWA8_GRVWB8_GSUAMB_GLS0_ISA1201_IU1_K1_LDSTI0_LBSPPA1024_LBSPPB0_LBSPPM0_LPA32_LPB0_LPM0_LRVW8_LWPMn1_MIAV1_MIWT2_2_MO40_NTn1_NTA0_NTB0_NTC0_NTD0_NTM0_NEPBS0_NLCA1_NLCB2_ONLL0_PGR2_PLR1_PKA0_SIA3_SS0_SPO0_SRVW0_SSO0_SVW8_SK0_SKFTR0_SKXCCM0_TLDS0_ULSGRO0_USL1_UIOFGRO0_USFGROn1_VSn1_VWA2_VWB1_WSGRA0_WSGRB0_WS32_WG32_4_1
Tile: MT64x64x64, MI16x16x1, ISA1201. Flags distinguishing the B-operand path: DTVB1 (B non-default dtype/layout), LBSPPB0 (no B LDS prefetch), NLCB2 (B double-unrolled), VWA2_VWB1.
The faulting call — operands vs. fault address
GEMM dimensions (derived from the logged operands): M=608, N=5376, K=21504, bf16, batch 1. A is row-major contiguous (M×K); B is column-major (K×N), stride=(1, 21504); C is (M×N) bf16 (~6 MB). The pair logged immediately before the fault:
A (gemm input) shape=(1, 608, 21504) stride=(13074432, 21504, 1) contig=True ptr=0x7f655ece8000 end=0x7f65605d8000
B (dequant wt) shape=(21504, 5376) stride=(1, 21504) contig=False ptr=0x7f63f01a0000 end=0x7f63fde20000
FAULT: 0x7f6459a00000
| Operand |
range |
fault inside? |
A (1,608,21504) row-major contig |
0x7f655ece8000 … 0x7f65605d8000 |
No — fault ~1.0 GB below ptr |
B (21504,5376) col-major stride=(1,21504) |
0x7f63f01a0000 … 0x7f63fde20000 |
No — fault ~1.2 GB above end |
A full scan of the run logs 0 GEMM operands that bracket 0x7f6459a00000. The fault sits in unmapped space ~1 GB from either operand cluster — not adjacent to any buffer boundary, which rules out a simple off-by-N or undersized-descriptor read.
Memory access fault by GPU node-1 on address 0x7f6459a00000.
Reason: Page not present or supervisor privilege.
(Address varies run to run — e.g. 0x7f0ac2e00000, 0x7f29eb600000 in other runs — but is consistently ~1 GB from the operands and 2 MB-aligned.)
Reproducibility
- Deterministic on a clean GPU, faults reproducibly within a few steps of resuming the workload.
- Persists under serialization —
AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1 → not an async race.
- Both directions — forward-recompute GEMM and backward GEMM dispatch the identical tile and both fault.
Minimal reproducer
TODO — rocblas-bench line (last artifact, capture in progress). rocblas-bench is installed (/opt/rocm/bin/rocblas-bench). Capturing one ROCBLAS_LAYER=2 run emits the exact rocblas-bench -f gemm_ex --transposeA … --transposeB … -m 608 -n 5376 -k 21504 --lda … --ldb … --ldc … line for the faulting GEMM; running that line directly reproduces the fault independent of PyTorch/bitsandbytes, isolating it entirely within rocBLAS/Tensile. Will be attached here as a follow-up.
Until then the trigger is reachable via any gemm_ex with this selection: bf16, M=608/N=5376/K=21504, transposed/column-major B (the DTVB1 path), on gfx1201 with the Tensile backend.
Ruled out (to save triage time)
| Hypothesis |
Evidence |
| Asynchronous / concurrency race |
Persists under AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1. |
| Caller passed an undersized / wrong-shape operand |
Both operands logged, valid, contiguous as expected; fault is ~1 GB from either boundary, not adjacent. |
| bitsandbytes NF4 dequant kernel |
Dequant kernels succeed every iteration; fault is in the GEMM consuming their output. |
| hipBLASLt |
TORCH_USE_HIPBLASLT=1 was tried — hipBLASLt has no kernel for this MT64x64x64 DTVB1 shape (logs Cannot find the function ×6) and falls back to this same Tensile kernel; fault persists. |
| Allocator placement / fragmentation |
PYTORCH_HIP_ALLOC_CONF GC-threshold and expandable_segments changes do not move or eliminate the fault (expandable_segments is in fact ignored on gfx1201). |
| OOM |
VRAM at fault ~21 GB allocated / ~28 GB reserved of 32 GB — within headroom. |
Open / unconfirmed
- Column-major B as the trigger is observed, not yet proven. Every faulting call has a
stride=(1,K) B operand and the DTVB1 kernel flag, and the same tile succeeds on other shapes — strong correlation. Forcing the dequantized weight row-major at the framework (Python) level does not change the outcome and does not test the hypothesis: bitsandbytes sets the transpose via the BLAS transB flag (the A @ Wᵀ call structure), not via the tensor's physical stride, so a Python .contiguous() never reaches the rocBLAS descriptor (the DTVB1 kernel still dispatches and still faults). The definitive test is at the rocBLAS layer: run the captured rocblas-bench line with --transposeB T (reproduces) vs. an equivalent --transposeB N (row-major B → DTVB0, a different tile) and compare. That experiment is folded into the reproducer capture above.
- Output buffer not instrumented. A page-not-present is formally consistent with an OOB write to C rather than a wild read of an input. Considered implausible (C is ~6 MB and the fault is ~1 GB away) but not ruled out by instrumentation.
Impact
Blocks NF4 QLoRA fine-tuning of large models (Gemma-4-31B) on gfx1201/RDNA4 under ROCm 7.2 — any backward pass that dispatches this tile faults. No userspace workaround found (hipBLASLt lacks the kernel; allocator/serialization knobs don't help).
Possibly related
- rocm-libraries#6166 — rocBLAS unit tests (
dot_ex) on gfx1201 produce the identical Memory access fault … Page not present signature with no PyTorch/bitsandbytes in the picture. Strongest corroboration that gfx1201 rocBLAS kernels page-fault at the library level; likely an in-house-runnable repro of the same class of bug.
- rocm-libraries#4097 — same kernel family (
MT64x64x64 … DTVB1 … ISA1201) on Windows, but the missing-kernel (hipErrorNotFound) mode rather than a present-but-faulting kernel. Suggests the DTVB1 ISA1201 MT64x64x64 tiles are the trouble spot for gfx1201.
- rocm-libraries#7192 — same GPU (R9700); a wrong-Tensile-file (
gfx1200.dat) lookup. Ruled out as our mechanism (we confirmed the gfx1201 library is present and the kernel is ISA1201), included as same-hardware context.
Operating System
CachyOS, kernel 6.x (7.0.10-2-cachyos), in-kernel (inbox) amdgpu driver (not amdgpu-dkms), linux-firmware-amdgpu 20260519
CPU
Ryzen 9 5900X
GPU
AMD R9700 GPU AI PRO
ROCm Version
7.2.3
ROCm Component
Tensile
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
❯ /opt/rocm/bin/rocminfo --support
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.18
Runtime Ext Version: 1.15
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
XNACK enabled: NO
DMAbuf Support: YES
VMM Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 9 5900X 12-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 9 5900X 12-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 4954
BDFID: 0
Internal Node ID: 0
Compute Unit: 24
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 65740760(0x3eb1fd8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 65740760(0x3eb1fd8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 65740760(0x3eb1fd8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 65740760(0x3eb1fd8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx1201
Uuid: GPU-bdad470151adc4b7
Marketing Name: AMD Radeon AI PRO R9700
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 8192(0x2000) KB
L3: 65536(0x10000) KB
Chip ID: 30033(0x7551)
ASIC Revision: 1(0x1)
Cacheline Size: 256(0x100)
Max Clock Freq. (MHz): 2350
BDFID: 2560
Internal Node ID: 1
Compute Unit: 64
SIMDs per CU: 2
Shader Engines: 4
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 268
SDMA engine uCode:: 662
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 33406976(0x1fdc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1201
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
FBarrier Max Size: 32
ISA 2
Name: amdgcn-amd-amdhsa--gfx12-generic
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
FBarrier Max Size: 32
*** Done ***
Additional Information
No response
Problem Description
Summary
On gfx1201 (RDNA4) under ROCm 7.2.3, a Tensile-generated GEMM kernel (
Cijk_…_MT64x64x64_…_ISA1201) triggers a reproducible GPU page fault ("page not present"). The faulting access lands at a host-VA address ~1 GB outside the bounds of both GEMM operands, indicating the kernel computes an out-of-bounds address rather than reading a valid-but-unexpected buffer. The fault is deterministic enough to reproduce on a clean GPU and persists under full kernel serialization (AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1), so it is not an asynchronous race. The same kernel and tile succeed on other problem sizes in the same workload; the fault is specific to a call whose B operand is column-major (stride=(1, K)).The workload is Gemma-4-31B NF4 QLoRA fine-tuning (PyTorch + bitsandbytes); the fault is in the rocBLAS GEMM consuming the dequantized weight, not in the bitsandbytes dequant kernel (verified — see "Ruled out").
Background information
I am doing SFT on Gemm4-31B on an AMD R700 AI PRO and I kept on crashing. If needed I also have wandb.ai logs (system and training info).
More in-depth notes and experiment ran on this PR in my project: ulises-c/csen-346#101
I created an issue on my own repo to track what to post here: ulises-c/csen-346#113
Environment
113-1E4990U-S83, SKU 1E4990U, rev 0xc07.0.10-2-cachyos), in-kernel (inbox) amdgpu driver (not amdgpu-dkms), linux-firmware-amdgpu20260519torch.version.hip= 7.2.26015)TORCH_USE_HIPBLASLT=0; see "Ruled out" — hipBLASLt has no kernel for this shape and falls back to Tensile regardless)Faulting kernel (full ShaderName)
Captured at
AMD_LOG_LEVEL=3on a clean GPU. The same kernel variant faults in both the forward and backward GEMM of the workload:Tile:
MT64x64x64,MI16x16x1,ISA1201. Flags distinguishing the B-operand path:DTVB1(B non-default dtype/layout),LBSPPB0(no B LDS prefetch),NLCB2(B double-unrolled),VWA2_VWB1.The faulting call — operands vs. fault address
GEMM dimensions (derived from the logged operands): M=608, N=5376, K=21504, bf16, batch 1. A is row-major contiguous (M×K); B is column-major (K×N),
stride=(1, 21504); C is (M×N) bf16 (~6 MB). The pair logged immediately before the fault:(1,608,21504)row-major contig0x7f655ece8000…0x7f65605d8000ptr(21504,5376)col-majorstride=(1,21504)0x7f63f01a0000…0x7f63fde20000endA full scan of the run logs 0 GEMM operands that bracket
0x7f6459a00000. The fault sits in unmapped space ~1 GB from either operand cluster — not adjacent to any buffer boundary, which rules out a simple off-by-N or undersized-descriptor read.(Address varies run to run — e.g.
0x7f0ac2e00000,0x7f29eb600000in other runs — but is consistently ~1 GB from the operands and 2 MB-aligned.)Reproducibility
AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1→ not an async race.Minimal reproducer
TODO —
rocblas-benchline (last artifact, capture in progress).rocblas-benchis installed (/opt/rocm/bin/rocblas-bench). Capturing oneROCBLAS_LAYER=2run emits the exactrocblas-bench -f gemm_ex --transposeA … --transposeB … -m 608 -n 5376 -k 21504 --lda … --ldb … --ldc …line for the faulting GEMM; running that line directly reproduces the fault independent of PyTorch/bitsandbytes, isolating it entirely within rocBLAS/Tensile. Will be attached here as a follow-up.Until then the trigger is reachable via any
gemm_exwith this selection: bf16, M=608/N=5376/K=21504, transposed/column-major B (theDTVB1path), on gfx1201 with the Tensile backend.Ruled out (to save triage time)
AMD_SERIALIZE_KERNEL=3 HIP_LAUNCH_BLOCKING=1.TORCH_USE_HIPBLASLT=1was tried — hipBLASLt has no kernel for thisMT64x64x64 DTVB1shape (logsCannot find the function×6) and falls back to this same Tensile kernel; fault persists.PYTORCH_HIP_ALLOC_CONFGC-threshold andexpandable_segmentschanges do not move or eliminate the fault (expandable_segmentsis in fact ignored on gfx1201).Open / unconfirmed
stride=(1,K)B operand and theDTVB1kernel flag, and the same tile succeeds on other shapes — strong correlation. Forcing the dequantized weight row-major at the framework (Python) level does not change the outcome and does not test the hypothesis: bitsandbytes sets the transpose via the BLAStransBflag (theA @ Wᵀcall structure), not via the tensor's physical stride, so a Python.contiguous()never reaches the rocBLAS descriptor (theDTVB1kernel still dispatches and still faults). The definitive test is at the rocBLAS layer: run the capturedrocblas-benchline with--transposeB T(reproduces) vs. an equivalent--transposeB N(row-major B →DTVB0, a different tile) and compare. That experiment is folded into the reproducer capture above.Impact
Blocks NF4 QLoRA fine-tuning of large models (Gemma-4-31B) on gfx1201/RDNA4 under ROCm 7.2 — any backward pass that dispatches this tile faults. No userspace workaround found (hipBLASLt lacks the kernel; allocator/serialization knobs don't help).
Possibly related
dot_ex) on gfx1201 produce the identicalMemory access fault … Page not presentsignature with no PyTorch/bitsandbytes in the picture. Strongest corroboration that gfx1201 rocBLAS kernels page-fault at the library level; likely an in-house-runnable repro of the same class of bug.MT64x64x64 … DTVB1 … ISA1201) on Windows, but the missing-kernel (hipErrorNotFound) mode rather than a present-but-faulting kernel. Suggests theDTVB1ISA1201MT64x64x64tiles are the trouble spot for gfx1201.gfx1200.dat) lookup. Ruled out as our mechanism (we confirmed the gfx1201 library is present and the kernel isISA1201), included as same-hardware context.Operating System
CachyOS, kernel 6.x (7.0.10-2-cachyos), in-kernel (inbox) amdgpu driver (not amdgpu-dkms), linux-firmware-amdgpu 20260519
CPU
Ryzen 9 5900X
GPU
AMD R9700 GPU AI PRO
ROCm Version
7.2.3
ROCm Component
Tensile
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
Additional Information
No response