Codestin Search App

msaelices · 2026-06-09T07:27:53Z

Summary

Fixes #6653
max serve aborts with LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.ldmatrix.sync.aligned.m8n8.x4.b16
when serving common bf16 LLMs (Qwen3, Gemma 3, Llama 3.2) on NVIDIA Turing GPUs (e.g. Tesla T4 / g4dn instances).

Probing a T4 (sm_75) directly shows there are two distinct blockers, not just ldmatrix:

Op on sm_75	Result	Cause
`ldmatrix.m8n8.x4.b16`	fails	backend emits PTX ISA < 6.5 for sm_75; neither the NVVM intrinsic nor raw `ldmatrix` PTX selects
`mma.*.bf16`	fails	hardware — Turing tensor cores have no bf16/tf32 (`requires sm_80`)
naive SIMT matmul / `mha_gpu_naive`	works	scalar FMA, no tensor cores

Since the models are bf16 and MAX routes both matmul and attention through the tensor-core path, fixing ldmatrix alone is insufficient, the bf16 mma is a hardware wall.

The fix routes pre-Ampere NVIDIA to the existing non-tensor-core kernels, and adds a portable ldmatrix fallback.

Changes

[Stdlib][GPU] mma.mojo: software-emulate ld_matrix on pre-Ampere NVIDIA (Volta sm_70, Turing sm_75) using shared-memory loads + warp shuffles.
[Kernels][GPU] route NVIDIA GPUs with compute capability < 8.0:
- _matmul_gpu → naive SIMT matmul kernel.
- flash_attention → mha_gpu_naive.

Ampere and newer are completely unaffected (the new paths are behind comptime gates on compute < A100.compute).

Validation on a Tesla T4 (sm_75)

Reproduced the crash with a minimal kernel and with Qwen3-0.6B.
The in-tree emulated ld_matrix compiles and matches the documented
ldmatrix.m8n8 fragment layout for non-transposed and transposed loads
(validated against a known shared-memory pattern).
The naive bf16 matmul produces correct results on the T4.
The dispatch predicate selects the naive paths on the T4
(RTX2060, compute 7.5 < 8.0).

Assisted by AI

BEGIN_PUBLIC [Stdlib][GPU] Add pre-Ampere ldmatrix software fallback `ld_matrix` lowers to the `llvm.nvvm.ldmatrix.sync.aligned.m8n8` NVVM intrinsic, which fails instruction selection on pre-Ampere NVIDIA GPUs (Volta sm_70, Turing sm_75); even raw `ldmatrix` PTX requires a newer ISA than is emitted for those targets. This made `max serve` abort with `LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.ldmatrix...x4.b16` when serving common LLMs on Turing GPUs such as the NVIDIA T4. Emulate the instruction's data movement on these targets using portable shared-memory loads and warp shuffles. `ldmatrix.m8n8.b16` is pure 16-bit cell movement, so the emulation operates on 32-bit words and is independent of the element dtype; both the plain and `transpose` forms are covered. The native intrinsic path is unchanged for sm_80 and newer. Validated on a Tesla T4 (sm_75): the emulated fragment layout matches the documented ldmatrix m8n8 mapping for the x1/x2/x4 variants and the transposed load. END_PUBLIC Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Signed-off-by: Manuel Smith <[email protected]>

…ernels BEGIN_PUBLIC [Kernels][GPU] Route pre-Ampere NVIDIA matmul and attention to SIMT kernels Turing (sm_75) and Volta (sm_70) tensor cores do not support bf16/tf32 MMA, so the tensor-core matmul and flash-attention kernels cannot run there: they emit bf16 `mma` (unavailable in hardware) and `ldmatrix`. As a result common bf16 LLMs (Qwen3, Gemma 3, Llama 3.2) aborted during model compilation on these GPUs. Route pre-Ampere NVIDIA GPUs (compute capability < 8.0) to the existing non-tensor-core fallbacks: - `_matmul_gpu` dispatches to the naive SIMT matmul kernel. - `flash_attention` dispatches to `mha_gpu_naive`. Ampere and newer keep the tensor-core paths unchanged. Verified on a Tesla T4 (sm_75) that the dispatch predicate selects the naive paths and that a bf16 matmul through the SIMT kernel produces correct results. END_PUBLIC Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Signed-off-by: Manuel Smith <[email protected]>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds support/fallbacks for pre-Ampere NVIDIA GPUs by emulating ld_matrix where needed and preventing selection of tensor-core-dependent kernels (flash-attention / matmul) that won’t codegen or run correctly on sm_70/sm_75.

Changes:

Emulate pre-Ampere ld_matrix via shared-memory loads + warp shuffles in std/gpu/compute/mma.mojo.
Gate flash-attention tensor-core path behind a new GPU capability predicate to avoid pre-Ampere selection.
Route pre-Ampere NVIDIA matmul/gemv dispatch to the SIMT naive kernel.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
mojo/stdlib/std/gpu/compute/mma.mojo	Adds pre-Ampere NV `ld_matrix` emulation using warp shuffles.
max/kernels/src/nn/attention/gpu/mha.mojo	Adds tensor-core capability gate to avoid flash-attention on pre-Ampere NV GPUs.
max/kernels/src/linalg/matmul/gpu/init.mojo	Adds explicit pre-Ampere NV fallback to naive SIMT matmul and refactors naive dispatch into a helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

msaelices · 2026-06-09T13:08:35Z

+    return SIMD[DType.uint32, 4](
+        shuffle_idx(row[0], src),
+        shuffle_idx(row[1], src),
+        shuffle_idx(row[2], src),
+        shuffle_idx(row[3], src),
+    )


The 4-word broadcast is intentional and a single-word shuffle would be incorrect here.

In the non-transpose path each receiving lane needs word lane % 4 of the source lane's row, and lane % 4 is per-lane-variable. shuffle_idx(var, src) returns the value of var as evaluated by lane src — so shuffle_idx(row_words[lane % 4], src) would return src's row_words[src % 4], not the receiving lane's row_words[lane % 4]. Because the word index differs across lanes, the source lane must expose all four words and each lane selects its own locally; that's what the 4 shuffles do. The transpose path (line 356/371) has the same constraint — word_idx is also derived from lane.

These are the slow-path pre-Ampere fallback GPUs (no ldmatrix at all), so keeping the emulation correct outweighs the extra shuffle traffic. Happy to add an inline comment making this explicit.

+            @parameter
+            for mtx in range(num_registers):
+                var src = UInt32(mtx * 8 + lane // 4)
+                regs[mtx] = _ld_matrix_broadcast_row(row_words, src)[lane % 4]


+                var w0 = _ld_matrix_broadcast_row(
+                    row_words, UInt32(mtx * 8 + base_row)
+                )[word_idx]
+                var w1 = _ld_matrix_broadcast_row(
+                    row_words, UInt32(mtx * 8 + base_row + 1)
+                )[word_idx]


+                var e0 = (w0 >> shift) & 0xFFFF
+                var e1 = (w1 >> shift) & 0xFFFF
+                regs[mtx] = e0 | (e1 << 16)


msaelices · 2026-06-09T13:07:35Z

+@always_inline
+def _attention_uses_tensor_cores(info: GPUInfo) -> Bool:
+    """Returns whether the flash-attention tensor-core path is usable on `info`.
+
+    Pre-Ampere NVIDIA GPUs (Volta sm_70, Turing sm_75) lack the bf16/tf32
+    tensor cores and `ldmatrix` codegen support that the flash-attention
+    kernels rely on, so they must fall back to the naive attention kernel
+    (`mha_gpu_naive`). See issue #6653.
+    """
+    return has_amd_gpu_accelerator() or info.compute >= A100.compute


Renamed to _attention_mma_path_supported

BEGIN_PUBLIC [Stdlib][GPU] Use comptime for in ldmatrix fallback The stdlib is built with `-Werror`, so the deprecated `@parameter for` in the pre-Ampere `ld_matrix` emulation broke the build. Switch to `comptime for` and type the b16 masks explicitly as `UInt32`. END_PUBLIC Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Signed-off-by: Manuel Smith <[email protected]>

BEGIN_PUBLIC [Kernels][GPU] Rename attention MMA-path predicate Rename `_attention_uses_tensor_cores` to `_attention_mma_path_supported`: the predicate also returns True for AMD GPUs (which use matrix cores, not NVIDIA tensor cores), so the new name better reflects "the accelerated MMA flash path is supported". END_PUBLIC Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Signed-off-by: Manuel Smith <[email protected]>

BEGIN_PUBLIC [Stdlib][GPU] Add pre-Ampere exp2/tanh fallbacks for half precision The `ex2.approx.f16`/`.bf16` and all `tanh.approx.*` PTX instructions require PTX ISA 7.0+, which the backend does not emit for pre-Ampere NVIDIA GPUs (Volta sm_70, Turing sm_75). This broke compilation of elementwise kernels (e.g. softmax, gelu) when targeting these GPUs. On pre-Ampere targets: - `exp2` for `float16`/`bfloat16` now computes via `ex2.approx.ftz.f32` (valid on all NVIDIA targets) in float32. - `tanh` computes from `exp2` as `1 - 2 / (exp2(2*log2(e)*x) + 1)`, since even `tanh.approx.f32` requires PTX ISA 7.0+. Ampere and newer are unchanged. Validated on a Tesla T4 (sm_75): exp2 and tanh produce correct results for float16 and bfloat16. END_PUBLIC Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Signed-off-by: Manuel Smith <[email protected]>

msaelices · 2026-06-09T09:14:06Z

✅ End-to-end validated on a Tesla T4 (sm_75)

Built the patched MAX from source (./bazelw run //max/python/max/entrypoints:pipelines -- generate) and ran google/gemma-3-1b-it on a real T4 — it now produces coherent output with no ldmatrix / bf16-mma / ptxas crash:

$ ... generate --model google/gemma-3-1b-it --prompt "The capital of France is"
The capital of France is Paris.
Do you want to know more about Paris?

Eval throughput (token-generation): 16.0 tokens/s
Prompt eval throughput (context-encoding): 6.7 tokens/s

Building from source surfaced one more pre-Ampere codegen gap, now fixed and included in this PR:

std/math exp2/tanh for float16/bfloat16 emitted ex2.approx.f16/tanh.approx.*, which require PTX ISA 7.0+ (not emitted for sm_70/sm_75). They now compute via ex2.approx.ftz.f32 in float32 on pre-Ampere. (tanh is derived from exp2 since even tanh.approx.f32 needs PTX 7.0+.)

So the full set of changes needed to run bf16 LLMs on Turing:

ld_matrix software emulation (shared-mem loads + warp shuffles).
matmul → naive SIMT kernel for pre-Ampere.
flash-attention → mha_gpu_naive for pre-Ampere.
exp2/tanh half-precision → float32 fallback for pre-Ampere.

Note: the local build also needed libxml2.so.2 present for the toolchain's ld.lld (an environment quirk on the test box, not a code change).

BradLarson · 2026-06-09T15:21:36Z

!sync

BradLarson · 2026-06-09T15:42:03Z

+            # The `1 - 2/(...)` form stays finite for large |x|. See issue #6653.
+            comptime two_log2e = 2.8853900817779269  # 2 * log2(e)
+            var e2x = exp2(x.cast[DType.float32]() * two_log2e)
+            return (1.0 - 2.0 / (e2x + 1.0)).cast[dtype]()


Can we just fall through to the CPU implementation here, and avoid needing another implementation?

Can we just fall through to the CPU implementation here, and avoid needing another implementation?

@BradLarson I've seen you have added the imported-internally label. Do you want me to fix this anyway?

I synced it internally to kick off a CI run. I'll re-sync as changes are made, which will re-evaluate on internal CI. Don't take that as a blocker to continue to iterate on this. This particular comment is one I'm passing along from an internal reviewer.

fixed: msaelices@7d1d789

BradLarson · 2026-06-09T15:47:17Z

-                vector_constraints="=r,r",
-            ](x)
+            else:
+                # Pre-Ampere (Volta sm_70, Turing sm_75): `ex2.approx.f16`


Technically, on reading the PTX notes: https://docs.nvidia.com/cuda/parallel-thread-execution/ , ex2.approx.f16 works for sm_75 (Turing) and newer, but ex2.approx.bf16 is sm_90 or newer.

Per review feedback, drop the bespoke `exp2`-based tanh fallback for pre-Ampere NVIDIA (Volta sm_70, Turing sm_75) and instead fall through to the existing generic polynomial approximation. The generic path is arithmetic-only (no `tanh.approx`/`exp2` PTX intrinsics), so it codegens on every target, and reuses the same approximation already used elsewhere instead of maintaining a second implementation. The NVIDIA `tanh.approx` block is now gated on `_is_sm_8x_or_newer()`. BEGIN_PUBLIC [Stdlib][GPU] Use the generic tanh polynomial on pre-Ampere NVIDIA GPUs Drop the bespoke exp2-based tanh fallback for pre-Ampere NVIDIA GPUs (Volta sm_70, Turing sm_75) and fall through to the existing generic polynomial approximation, which is arithmetic-only and codegens on every target. See issue modular#6653. END_PUBLIC Signed-off-by: Manuel Smith <[email protected]>

msaelices and others added 2 commits June 9, 2026 07:11

Copilot AI review requested due to automatic review settings June 9, 2026 07:27

msaelices requested review from a team as code owners June 9, 2026 07:27

github-actions Bot added mojo-stdlib Tag for issues related to standard library waiting-on-review labels Jun 9, 2026

Copilot AI reviewed Jun 9, 2026

View reviewed changes

msaelices marked this pull request as draft June 9, 2026 07:30

msaelices and others added 4 commits June 9, 2026 09:31

Merge branch 'main' into turing-sm75-matmul-support

5d71e0a

Merge branch 'main' into turing-sm75-matmul-support

b5b78b0

msaelices marked this pull request as ready for review June 9, 2026 13:12

modular-automation Bot assigned BradLarson Jun 9, 2026

modular-automation Bot removed the waiting-on-review label Jun 9, 2026

modularbot added the imported-internally Signals that a given pull request has been imported internally. label Jun 9, 2026

BradLarson reviewed Jun 9, 2026

View reviewed changes

github-actions Bot added the waiting-on-review label Jun 9, 2026

msaelices requested a review from BradLarson June 10, 2026 07:40

msaelices mentioned this pull request Jun 11, 2026

[Feature Request] NVFP4 dequant fallback for pre-Blackwell NVIDIA GPUs (serve Gemma 4 NVFP4 on Ampere) #6667

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernels][GPU] Support pre-Ampere NVIDIA GPUs (Turing/Volta) for bf16 LLMs#6659

[Kernels][GPU] Support pre-Ampere NVIDIA GPUs (Turing/Volta) for bf16 LLMs#6659
msaelices wants to merge 8 commits into
modular:mainfrom
msaelices:turing-sm75-matmul-support

msaelices commented Jun 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

msaelices Jun 9, 2026

Uh oh!

msaelices Jun 9, 2026

Uh oh!

msaelices Jun 9, 2026 •

edited

Loading

Uh oh!

msaelices commented Jun 9, 2026

Uh oh!

BradLarson commented Jun 9, 2026

Uh oh!

BradLarson Jun 9, 2026

Uh oh!

msaelices Jun 9, 2026

Uh oh!

BradLarson Jun 9, 2026

Uh oh!

msaelices Jun 9, 2026

Uh oh!

BradLarson Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

msaelices commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Validation on a Tesla T4 (sm_75)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

msaelices Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

msaelices Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

msaelices Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msaelices commented Jun 9, 2026

✅ End-to-end validated on a Tesla T4 (sm_75)

Uh oh!

BradLarson commented Jun 9, 2026

Uh oh!

BradLarson Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

msaelices Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

BradLarson Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

msaelices Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

BradLarson Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

msaelices commented Jun 9, 2026 •

edited

Loading

msaelices Jun 9, 2026 •

edited

Loading