[Kernels][GPU] Support pre-Ampere NVIDIA GPUs (Turing/Volta) for bf16 LLMs#6659
[Kernels][GPU] Support pre-Ampere NVIDIA GPUs (Turing/Volta) for bf16 LLMs#6659msaelices wants to merge 8 commits into
Conversation
BEGIN_PUBLIC [Stdlib][GPU] Add pre-Ampere ldmatrix software fallback `ld_matrix` lowers to the `llvm.nvvm.ldmatrix.sync.aligned.m8n8` NVVM intrinsic, which fails instruction selection on pre-Ampere NVIDIA GPUs (Volta sm_70, Turing sm_75); even raw `ldmatrix` PTX requires a newer ISA than is emitted for those targets. This made `max serve` abort with `LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.ldmatrix...x4.b16` when serving common LLMs on Turing GPUs such as the NVIDIA T4. Emulate the instruction's data movement on these targets using portable shared-memory loads and warp shuffles. `ldmatrix.m8n8.b16` is pure 16-bit cell movement, so the emulation operates on 32-bit words and is independent of the element dtype; both the plain and `transpose` forms are covered. The native intrinsic path is unchanged for sm_80 and newer. Validated on a Tesla T4 (sm_75): the emulated fragment layout matches the documented ldmatrix m8n8 mapping for the x1/x2/x4 variants and the transposed load. END_PUBLIC Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Signed-off-by: Manuel Smith <[email protected]>
…ernels BEGIN_PUBLIC [Kernels][GPU] Route pre-Ampere NVIDIA matmul and attention to SIMT kernels Turing (sm_75) and Volta (sm_70) tensor cores do not support bf16/tf32 MMA, so the tensor-core matmul and flash-attention kernels cannot run there: they emit bf16 `mma` (unavailable in hardware) and `ldmatrix`. As a result common bf16 LLMs (Qwen3, Gemma 3, Llama 3.2) aborted during model compilation on these GPUs. Route pre-Ampere NVIDIA GPUs (compute capability < 8.0) to the existing non-tensor-core fallbacks: - `_matmul_gpu` dispatches to the naive SIMT matmul kernel. - `flash_attention` dispatches to `mha_gpu_naive`. Ampere and newer keep the tensor-core paths unchanged. Verified on a Tesla T4 (sm_75) that the dispatch predicate selects the naive paths and that a bf16 matmul through the SIMT kernel produces correct results. END_PUBLIC Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Signed-off-by: Manuel Smith <[email protected]>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds support/fallbacks for pre-Ampere NVIDIA GPUs by emulating ld_matrix where needed and preventing selection of tensor-core-dependent kernels (flash-attention / matmul) that won’t codegen or run correctly on sm_70/sm_75.
Changes:
- Emulate pre-Ampere
ld_matrixvia shared-memory loads + warp shuffles instd/gpu/compute/mma.mojo. - Gate flash-attention tensor-core path behind a new GPU capability predicate to avoid pre-Ampere selection.
- Route pre-Ampere NVIDIA matmul/gemv dispatch to the SIMT naive kernel.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| mojo/stdlib/std/gpu/compute/mma.mojo | Adds pre-Ampere NV ld_matrix emulation using warp shuffles. |
| max/kernels/src/nn/attention/gpu/mha.mojo | Adds tensor-core capability gate to avoid flash-attention on pre-Ampere NV GPUs. |
| max/kernels/src/linalg/matmul/gpu/init.mojo | Adds explicit pre-Ampere NV fallback to naive SIMT matmul and refactors naive dispatch into a helper. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return SIMD[DType.uint32, 4]( | ||
| shuffle_idx(row[0], src), | ||
| shuffle_idx(row[1], src), | ||
| shuffle_idx(row[2], src), | ||
| shuffle_idx(row[3], src), | ||
| ) |
There was a problem hiding this comment.
The 4-word broadcast is intentional and a single-word shuffle would be incorrect here.
In the non-transpose path each receiving lane needs word lane % 4 of the source lane's row, and lane % 4 is per-lane-variable. shuffle_idx(var, src) returns the value of var as evaluated by lane src — so shuffle_idx(row_words[lane % 4], src) would return src's row_words[src % 4], not the receiving lane's row_words[lane % 4]. Because the word index differs across lanes, the source lane must expose all four words and each lane selects its own locally; that's what the 4 shuffles do. The transpose path (line 356/371) has the same constraint — word_idx is also derived from lane.
These are the slow-path pre-Ampere fallback GPUs (no ldmatrix at all), so keeping the emulation correct outweighs the extra shuffle traffic. Happy to add an inline comment making this explicit.
| @parameter | ||
| for mtx in range(num_registers): | ||
| var src = UInt32(mtx * 8 + lane // 4) | ||
| regs[mtx] = _ld_matrix_broadcast_row(row_words, src)[lane % 4] |
| var w0 = _ld_matrix_broadcast_row( | ||
| row_words, UInt32(mtx * 8 + base_row) | ||
| )[word_idx] | ||
| var w1 = _ld_matrix_broadcast_row( | ||
| row_words, UInt32(mtx * 8 + base_row + 1) | ||
| )[word_idx] |
| var e0 = (w0 >> shift) & 0xFFFF | ||
| var e1 = (w1 >> shift) & 0xFFFF | ||
| regs[mtx] = e0 | (e1 << 16) |
| @always_inline | ||
| def _attention_uses_tensor_cores(info: GPUInfo) -> Bool: | ||
| """Returns whether the flash-attention tensor-core path is usable on `info`. | ||
|
|
||
| Pre-Ampere NVIDIA GPUs (Volta sm_70, Turing sm_75) lack the bf16/tf32 | ||
| tensor cores and `ldmatrix` codegen support that the flash-attention | ||
| kernels rely on, so they must fall back to the naive attention kernel | ||
| (`mha_gpu_naive`). See issue #6653. | ||
| """ | ||
| return has_amd_gpu_accelerator() or info.compute >= A100.compute |
There was a problem hiding this comment.
Renamed to _attention_mma_path_supported
BEGIN_PUBLIC [Stdlib][GPU] Use comptime for in ldmatrix fallback The stdlib is built with `-Werror`, so the deprecated `@parameter for` in the pre-Ampere `ld_matrix` emulation broke the build. Switch to `comptime for` and type the b16 masks explicitly as `UInt32`. END_PUBLIC Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Signed-off-by: Manuel Smith <[email protected]>
BEGIN_PUBLIC [Kernels][GPU] Rename attention MMA-path predicate Rename `_attention_uses_tensor_cores` to `_attention_mma_path_supported`: the predicate also returns True for AMD GPUs (which use matrix cores, not NVIDIA tensor cores), so the new name better reflects "the accelerated MMA flash path is supported". END_PUBLIC Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Signed-off-by: Manuel Smith <[email protected]>
BEGIN_PUBLIC
[Stdlib][GPU] Add pre-Ampere exp2/tanh fallbacks for half precision
The `ex2.approx.f16`/`.bf16` and all `tanh.approx.*` PTX instructions require
PTX ISA 7.0+, which the backend does not emit for pre-Ampere NVIDIA GPUs
(Volta sm_70, Turing sm_75). This broke compilation of elementwise kernels
(e.g. softmax, gelu) when targeting these GPUs.
On pre-Ampere targets:
- `exp2` for `float16`/`bfloat16` now computes via `ex2.approx.ftz.f32`
(valid on all NVIDIA targets) in float32.
- `tanh` computes from `exp2` as `1 - 2 / (exp2(2*log2(e)*x) + 1)`, since
even `tanh.approx.f32` requires PTX ISA 7.0+.
Ampere and newer are unchanged. Validated on a Tesla T4 (sm_75): exp2 and
tanh produce correct results for float16 and bfloat16.
END_PUBLIC
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Signed-off-by: Manuel Smith <[email protected]>
✅ End-to-end validated on a Tesla T4 (sm_75)Built the patched MAX from source ( Building from source surfaced one more pre-Ampere codegen gap, now fixed and included in this PR:
So the full set of changes needed to run bf16 LLMs on Turing:
|
|
!sync |
| # The `1 - 2/(...)` form stays finite for large |x|. See issue #6653. | ||
| comptime two_log2e = 2.8853900817779269 # 2 * log2(e) | ||
| var e2x = exp2(x.cast[DType.float32]() * two_log2e) | ||
| return (1.0 - 2.0 / (e2x + 1.0)).cast[dtype]() |
There was a problem hiding this comment.
Can we just fall through to the CPU implementation here, and avoid needing another implementation?
There was a problem hiding this comment.
Can we just fall through to the CPU implementation here, and avoid needing another implementation?
@BradLarson I've seen you have added the imported-internally label. Do you want me to fix this anyway?
There was a problem hiding this comment.
I synced it internally to kick off a CI run. I'll re-sync as changes are made, which will re-evaluate on internal CI. Don't take that as a blocker to continue to iterate on this. This particular comment is one I'm passing along from an internal reviewer.
| vector_constraints="=r,r", | ||
| ](x) | ||
| else: | ||
| # Pre-Ampere (Volta sm_70, Turing sm_75): `ex2.approx.f16` |
There was a problem hiding this comment.
Technically, on reading the PTX notes: https://docs.nvidia.com/cuda/parallel-thread-execution/ , ex2.approx.f16 works for sm_75 (Turing) and newer, but ex2.approx.bf16 is sm_90 or newer.
Per review feedback, drop the bespoke `exp2`-based tanh fallback for pre-Ampere NVIDIA (Volta sm_70, Turing sm_75) and instead fall through to the existing generic polynomial approximation. The generic path is arithmetic-only (no `tanh.approx`/`exp2` PTX intrinsics), so it codegens on every target, and reuses the same approximation already used elsewhere instead of maintaining a second implementation. The NVIDIA `tanh.approx` block is now gated on `_is_sm_8x_or_newer()`. BEGIN_PUBLIC [Stdlib][GPU] Use the generic tanh polynomial on pre-Ampere NVIDIA GPUs Drop the bespoke exp2-based tanh fallback for pre-Ampere NVIDIA GPUs (Volta sm_70, Turing sm_75) and fall through to the existing generic polynomial approximation, which is arithmetic-only and codegens on every target. See issue modular#6653. END_PUBLIC Signed-off-by: Manuel Smith <[email protected]>
Summary
Fixes #6653
max serveaborts withLLVM ERROR: Cannot select: intrinsic %llvm.nvvm.ldmatrix.sync.aligned.m8n8.x4.b16when serving common bf16 LLMs (Qwen3, Gemma 3, Llama 3.2) on NVIDIA Turing GPUs (e.g. Tesla T4 /
g4dninstances).Probing a T4 (sm_75) directly shows there are two distinct blockers, not just
ldmatrix:ldmatrix.m8n8.x4.b16ldmatrixPTX selectsmma.*.bf16requires sm_80)mha_gpu_naiveSince the models are bf16 and MAX routes both matmul and attention through the tensor-core path, fixing
ldmatrixalone is insufficient, the bf16mmais a hardware wall.The fix routes pre-Ampere NVIDIA to the existing non-tensor-core kernels, and adds a portable
ldmatrixfallback.Changes
[Stdlib][GPU]mma.mojo: software-emulateld_matrixon pre-Ampere NVIDIA (Volta sm_70, Turing sm_75) using shared-memory loads + warp shuffles.[Kernels][GPU]route NVIDIA GPUs with compute capability< 8.0:_matmul_gpu→ naive SIMT matmul kernel.flash_attention→mha_gpu_naive.Ampere and newer are completely unaffected (the new paths are behind
comptimegates oncompute < A100.compute).Validation on a Tesla T4 (sm_75)
Qwen3-0.6B.ld_matrixcompiles and matches the documentedldmatrix.m8n8fragment layout for non-transposed and transposed loads(validated against a known shared-memory pattern).
(
RTX2060, compute7.5 < 8.0).Assisted by AI