Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Kernels][GPU] Support pre-Ampere NVIDIA GPUs (Turing/Volta) for bf16 LLMs#6659

Open
msaelices wants to merge 8 commits into
modular:mainfrom
msaelices:turing-sm75-matmul-support
Open

[Kernels][GPU] Support pre-Ampere NVIDIA GPUs (Turing/Volta) for bf16 LLMs#6659
msaelices wants to merge 8 commits into
modular:mainfrom
msaelices:turing-sm75-matmul-support

Conversation

@msaelices

@msaelices msaelices commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #6653
max serve aborts with LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.ldmatrix.sync.aligned.m8n8.x4.b16
when serving common bf16 LLMs (Qwen3, Gemma 3, Llama 3.2) on NVIDIA Turing GPUs (e.g. Tesla T4 / g4dn instances).

Probing a T4 (sm_75) directly shows there are two distinct blockers, not just ldmatrix:

Op on sm_75 Result Cause
ldmatrix.m8n8.x4.b16 fails backend emits PTX ISA < 6.5 for sm_75; neither the NVVM intrinsic nor raw ldmatrix PTX selects
mma.*.bf16 fails hardware — Turing tensor cores have no bf16/tf32 (requires sm_80)
naive SIMT matmul / mha_gpu_naive works scalar FMA, no tensor cores

Since the models are bf16 and MAX routes both matmul and attention through the tensor-core path, fixing ldmatrix alone is insufficient, the bf16 mma is a hardware wall.

The fix routes pre-Ampere NVIDIA to the existing non-tensor-core kernels, and adds a portable ldmatrix fallback.

Changes

  • [Stdlib][GPU] mma.mojo: software-emulate ld_matrix on pre-Ampere NVIDIA (Volta sm_70, Turing sm_75) using shared-memory loads + warp shuffles.
  • [Kernels][GPU] route NVIDIA GPUs with compute capability < 8.0:
    • _matmul_gpu → naive SIMT matmul kernel.
    • flash_attentionmha_gpu_naive.

Ampere and newer are completely unaffected (the new paths are behind comptime gates on compute < A100.compute).

Validation on a Tesla T4 (sm_75)

  • Reproduced the crash with a minimal kernel and with Qwen3-0.6B.
  • The in-tree emulated ld_matrix compiles and matches the documented
    ldmatrix.m8n8 fragment layout for non-transposed and transposed loads
    (validated against a known shared-memory pattern).
  • The naive bf16 matmul produces correct results on the T4.
  • The dispatch predicate selects the naive paths on the T4
    (RTX2060, compute 7.5 < 8.0).

Assisted by AI

msaelices and others added 2 commits June 9, 2026 07:11
BEGIN_PUBLIC
[Stdlib][GPU] Add pre-Ampere ldmatrix software fallback

`ld_matrix` lowers to the `llvm.nvvm.ldmatrix.sync.aligned.m8n8` NVVM
intrinsic, which fails instruction selection on pre-Ampere NVIDIA GPUs
(Volta sm_70, Turing sm_75); even raw `ldmatrix` PTX requires a newer ISA
than is emitted for those targets. This made `max serve` abort with
`LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.ldmatrix...x4.b16` when
serving common LLMs on Turing GPUs such as the NVIDIA T4.

Emulate the instruction's data movement on these targets using portable
shared-memory loads and warp shuffles. `ldmatrix.m8n8.b16` is pure 16-bit
cell movement, so the emulation operates on 32-bit words and is independent
of the element dtype; both the plain and `transpose` forms are covered. The
native intrinsic path is unchanged for sm_80 and newer.

Validated on a Tesla T4 (sm_75): the emulated fragment layout matches the
documented ldmatrix m8n8 mapping for the x1/x2/x4 variants and the
transposed load.
END_PUBLIC

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Signed-off-by: Manuel Smith <[email protected]>
…ernels

BEGIN_PUBLIC
[Kernels][GPU] Route pre-Ampere NVIDIA matmul and attention to SIMT kernels

Turing (sm_75) and Volta (sm_70) tensor cores do not support bf16/tf32 MMA,
so the tensor-core matmul and flash-attention kernels cannot run there: they
emit bf16 `mma` (unavailable in hardware) and `ldmatrix`. As a result common
bf16 LLMs (Qwen3, Gemma 3, Llama 3.2) aborted during model compilation on
these GPUs.

Route pre-Ampere NVIDIA GPUs (compute capability < 8.0) to the existing
non-tensor-core fallbacks:

  - `_matmul_gpu` dispatches to the naive SIMT matmul kernel.
  - `flash_attention` dispatches to `mha_gpu_naive`.

Ampere and newer keep the tensor-core paths unchanged. Verified on a Tesla
T4 (sm_75) that the dispatch predicate selects the naive paths and that a
bf16 matmul through the SIMT kernel produces correct results.
END_PUBLIC

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Signed-off-by: Manuel Smith <[email protected]>
Copilot AI review requested due to automatic review settings June 9, 2026 07:27
@msaelices msaelices requested review from a team as code owners June 9, 2026 07:27
@github-actions github-actions Bot added mojo-stdlib Tag for issues related to standard library waiting-on-review labels Jun 9, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds support/fallbacks for pre-Ampere NVIDIA GPUs by emulating ld_matrix where needed and preventing selection of tensor-core-dependent kernels (flash-attention / matmul) that won’t codegen or run correctly on sm_70/sm_75.

Changes:

  • Emulate pre-Ampere ld_matrix via shared-memory loads + warp shuffles in std/gpu/compute/mma.mojo.
  • Gate flash-attention tensor-core path behind a new GPU capability predicate to avoid pre-Ampere selection.
  • Route pre-Ampere NVIDIA matmul/gemv dispatch to the SIMT naive kernel.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
mojo/stdlib/std/gpu/compute/mma.mojo Adds pre-Ampere NV ld_matrix emulation using warp shuffles.
max/kernels/src/nn/attention/gpu/mha.mojo Adds tensor-core capability gate to avoid flash-attention on pre-Ampere NV GPUs.
max/kernels/src/linalg/matmul/gpu/init.mojo Adds explicit pre-Ampere NV fallback to naive SIMT matmul and refactors naive dispatch into a helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +254 to +259
return SIMD[DType.uint32, 4](
shuffle_idx(row[0], src),
shuffle_idx(row[1], src),
shuffle_idx(row[2], src),
shuffle_idx(row[3], src),
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 4-word broadcast is intentional and a single-word shuffle would be incorrect here.

In the non-transpose path each receiving lane needs word lane % 4 of the source lane's row, and lane % 4 is per-lane-variable. shuffle_idx(var, src) returns the value of var as evaluated by lane src — so shuffle_idx(row_words[lane % 4], src) would return src's row_words[src % 4], not the receiving lane's row_words[lane % 4]. Because the word index differs across lanes, the source lane must expose all four words and each lane selects its own locally; that's what the 4 shuffles do. The transpose path (line 356/371) has the same constraint — word_idx is also derived from lane.

These are the slow-path pre-Ampere fallback GPUs (no ldmatrix at all), so keeping the emulation correct outweighs the extra shuffle traffic. Happy to add an inline comment making this explicit.

@parameter
for mtx in range(num_registers):
var src = UInt32(mtx * 8 + lane // 4)
regs[mtx] = _ld_matrix_broadcast_row(row_words, src)[lane % 4]
Comment on lines +369 to +374
var w0 = _ld_matrix_broadcast_row(
row_words, UInt32(mtx * 8 + base_row)
)[word_idx]
var w1 = _ld_matrix_broadcast_row(
row_words, UInt32(mtx * 8 + base_row + 1)
)[word_idx]
Comment thread mojo/stdlib/std/gpu/compute/mma.mojo Outdated
Comment on lines +375 to +377
var e0 = (w0 >> shift) & 0xFFFF
var e1 = (w1 >> shift) & 0xFFFF
regs[mtx] = e0 | (e1 << 16)
Comment on lines +282 to +291
@always_inline
def _attention_uses_tensor_cores(info: GPUInfo) -> Bool:
"""Returns whether the flash-attention tensor-core path is usable on `info`.

Pre-Ampere NVIDIA GPUs (Volta sm_70, Turing sm_75) lack the bf16/tf32
tensor cores and `ldmatrix` codegen support that the flash-attention
kernels rely on, so they must fall back to the naive attention kernel
(`mha_gpu_naive`). See issue #6653.
"""
return has_amd_gpu_accelerator() or info.compute >= A100.compute

@msaelices msaelices Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to _attention_mma_path_supported

@msaelices msaelices marked this pull request as draft June 9, 2026 07:30
msaelices and others added 4 commits June 9, 2026 09:31
BEGIN_PUBLIC
[Stdlib][GPU] Use comptime for in ldmatrix fallback

The stdlib is built with `-Werror`, so the deprecated `@parameter for` in the
pre-Ampere `ld_matrix` emulation broke the build. Switch to `comptime for` and
type the b16 masks explicitly as `UInt32`.
END_PUBLIC

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Signed-off-by: Manuel Smith <[email protected]>
BEGIN_PUBLIC
[Kernels][GPU] Rename attention MMA-path predicate

Rename `_attention_uses_tensor_cores` to `_attention_mma_path_supported`: the
predicate also returns True for AMD GPUs (which use matrix cores, not NVIDIA
tensor cores), so the new name better reflects "the accelerated MMA flash path
is supported".
END_PUBLIC

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Signed-off-by: Manuel Smith <[email protected]>
BEGIN_PUBLIC
[Stdlib][GPU] Add pre-Ampere exp2/tanh fallbacks for half precision

The `ex2.approx.f16`/`.bf16` and all `tanh.approx.*` PTX instructions require
PTX ISA 7.0+, which the backend does not emit for pre-Ampere NVIDIA GPUs
(Volta sm_70, Turing sm_75). This broke compilation of elementwise kernels
(e.g. softmax, gelu) when targeting these GPUs.

On pre-Ampere targets:
  - `exp2` for `float16`/`bfloat16` now computes via `ex2.approx.ftz.f32`
    (valid on all NVIDIA targets) in float32.
  - `tanh` computes from `exp2` as `1 - 2 / (exp2(2*log2(e)*x) + 1)`, since
    even `tanh.approx.f32` requires PTX ISA 7.0+.

Ampere and newer are unchanged. Validated on a Tesla T4 (sm_75): exp2 and
tanh produce correct results for float16 and bfloat16.
END_PUBLIC

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Signed-off-by: Manuel Smith <[email protected]>
@msaelices

Copy link
Copy Markdown
Contributor Author

✅ End-to-end validated on a Tesla T4 (sm_75)

Built the patched MAX from source (./bazelw run //max/python/max/entrypoints:pipelines -- generate) and ran google/gemma-3-1b-it on a real T4 — it now produces coherent output with no ldmatrix / bf16-mma / ptxas crash:

$ ... generate --model google/gemma-3-1b-it --prompt "The capital of France is"
The capital of France is Paris.
Do you want to know more about Paris?

Eval throughput (token-generation): 16.0 tokens/s
Prompt eval throughput (context-encoding): 6.7 tokens/s

Building from source surfaced one more pre-Ampere codegen gap, now fixed and included in this PR:

  • std/math exp2/tanh for float16/bfloat16 emitted ex2.approx.f16/tanh.approx.*, which require PTX ISA 7.0+ (not emitted for sm_70/sm_75). They now compute via ex2.approx.ftz.f32 in float32 on pre-Ampere. (tanh is derived from exp2 since even tanh.approx.f32 needs PTX 7.0+.)

So the full set of changes needed to run bf16 LLMs on Turing:

  1. ld_matrix software emulation (shared-mem loads + warp shuffles).
  2. matmul → naive SIMT kernel for pre-Ampere.
  3. flash-attention → mha_gpu_naive for pre-Ampere.
  4. exp2/tanh half-precision → float32 fallback for pre-Ampere.

Note: the local build also needed libxml2.so.2 present for the toolchain's ld.lld (an environment quirk on the test box, not a code change).

@msaelices msaelices marked this pull request as ready for review June 9, 2026 13:12
@BradLarson

Copy link
Copy Markdown
Member

!sync

@modularbot modularbot added the imported-internally Signals that a given pull request has been imported internally. label Jun 9, 2026
Comment thread mojo/stdlib/std/math/math.mojo Outdated
# The `1 - 2/(...)` form stays finite for large |x|. See issue #6653.
comptime two_log2e = 2.8853900817779269 # 2 * log2(e)
var e2x = exp2(x.cast[DType.float32]() * two_log2e)
return (1.0 - 2.0 / (e2x + 1.0)).cast[dtype]()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just fall through to the CPU implementation here, and avoid needing another implementation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just fall through to the CPU implementation here, and avoid needing another implementation?

@BradLarson I've seen you have added the imported-internally label. Do you want me to fix this anyway?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I synced it internally to kick off a CI run. I'll re-sync as changes are made, which will re-evaluate on internal CI. Don't take that as a blocker to continue to iterate on this. This particular comment is one I'm passing along from an internal reviewer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector_constraints="=r,r",
](x)
else:
# Pre-Ampere (Volta sm_70, Turing sm_75): `ex2.approx.f16`

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, on reading the PTX notes: https://docs.nvidia.com/cuda/parallel-thread-execution/ , ex2.approx.f16 works for sm_75 (Turing) and newer, but ex2.approx.bf16 is sm_90 or newer.

Per review feedback, drop the bespoke `exp2`-based tanh fallback for pre-Ampere
NVIDIA (Volta sm_70, Turing sm_75) and instead fall through to the existing
generic polynomial approximation. The generic path is arithmetic-only (no
`tanh.approx`/`exp2` PTX intrinsics), so it codegens on every target, and reuses
the same approximation already used elsewhere instead of maintaining a second
implementation. The NVIDIA `tanh.approx` block is now gated on
`_is_sm_8x_or_newer()`.

BEGIN_PUBLIC
[Stdlib][GPU] Use the generic tanh polynomial on pre-Ampere NVIDIA GPUs

Drop the bespoke exp2-based tanh fallback for pre-Ampere NVIDIA GPUs (Volta
sm_70, Turing sm_75) and fall through to the existing generic polynomial
approximation, which is arithmetic-only and codegens on every target. See
issue modular#6653.
END_PUBLIC

Signed-off-by: Manuel Smith <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

imported-internally Signals that a given pull request has been imported internally. mojo-stdlib Tag for issues related to standard library waiting-on-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: max serve LLVM error when serving in nvidia T4 chips

4 participants