CUDA: refactor FA support/selection code #15454

JohannesGaessler · 2025-08-20T13:33:35Z

This PR refactors and deduplicates the CUDA code for determining which kernel to run. One of the possible return values of ggml_cuda_get_best_fattn_kernel is that there is no suitable kernel, this is re-used for determining whether the CUDA backend can run the ggml op. This PR fixes issues with e.g. Stories 260k which crashed due to unexpected head sizes, with the new code all head sizes that are not explicitly listed are treated as unsupported.

One functional change vs. master is that trying to use a quantized KV cache without GGML_CUDA_FA_ALL_QUANTS no longer results in a crash but instead falls back to the CPU code. This made the code simpler but I'm not 100% sure whether it's the right thing to do in terms of usability.

slaren · 2025-08-20T14:03:33Z

One functional change vs. master is that trying to use a quantized KV cache without GGML_CUDA_FA_ALL_QUANTS no longer results in a crash but instead falls back to the CPU code. This made the code simpler but I'm not 100% sure whether it's the right thing to do in terms of usability.

Not crashing is definitely correct, but I agree that it is usually not good to enable FA when it requires fallback to CPU. I think we can add a new FA mode called "auto" or similar that behaves such as:

If FA is supported, then it is enabled
If FA is not supported, but the settings can run without FA, then it is disabled
If FA is not supported, but FA is required due to V-cache quantization, then it would return an error

So FA would only be enabled when it is supported, and trying to use V quantization when not supported would result in an error, rather than fallback to CPU with terrible performance.

JohannesGaessler · 2025-08-20T15:10:44Z

Sounds good. So for the CLI interface, something like -fa 0, -fa 1, and -fa -1? And for -fa -1, create a graph, check whether all FA ggml ops are supported, and then set to either 0 or 1?

slaren · 2025-08-20T15:28:23Z

Essentially yes, but the logic can be more granular than that. Flash attention can be enabled at a per-layer level, so if you have device A that supports FA, and device B that doesn't, you can still use FA in the layers offloaded to device A.

LostRuins · 2025-08-21T11:31:00Z

One functional change vs. master is that trying to use a quantized KV cache without GGML_CUDA_FA_ALL_QUANTS no longer results in a crash but instead falls back to the CPU code

So FA would only be enabled when it is supported, and trying to use V quantization when not supported would result in an error, rather than fallback to CPU with terrible performance.

Can I just double check that using any K-cache is quantization can still be used independently of FA on/off, only V-cache quantization requires it, correct?

JohannesGaessler · 2025-08-21T12:04:22Z

Yes.

JohannesGaessler · 2025-08-21T12:12:25Z

I should also clarify: when I was talking about a "functional change" I meant it in the ggml backend context. llama.cpp still checks whether FA is enabled when the V cache is quantized and raises an error if it's not.

LostRuins · 2025-08-25T12:03:34Z

I noticed that rocwmma selection code seems to have been removed in this PR, and I don't see it referenced anywhere else.

https://github.com/ggml-org/llama.cpp/pull/15454/files#diff-0f4ae5d475b6fe0b22b0dc2afda5ffbe8d690feb23f161fa8c84cbbca0473649L283-L286

#if defined(GGML_HIP_ROCWMMA_FATTN)
    if (GGML_CUDA_CC_IS_AMD(cc) && fp16_mma_available(cc)) {
        ggml_cuda_flash_attn_ext_wmma_f16(ctx, dst);
        return;

is that intentional?

JohannesGaessler · 2025-08-25T12:28:19Z

rocWMMA support is handled via fp16_mma_available in common.cuh.

ericcurtin · 2025-08-27T04:00:53Z

Sounds good. So for the CLI interface, something like -fa 0, -fa 1, and -fa -1? And for -fa -1, create a graph, check whether all FA ggml ops are supported, and then set to either 0 or 1?

Worth a tweet or documentation or something of what llama-server should set for -fa on completion of this...

CUDA: refactor FA support/selection code

77f5152

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 20, 2025

slaren approved these changes Aug 20, 2025

View reviewed changes

JohannesGaessler merged commit 13aeb7a into ggml-org:master Aug 20, 2025
47 checks passed

broadbit-hu mentioned this pull request Aug 21, 2025

Eval bug: Nondeterministic output with ROCm backend despite zero temperature #14727

Open

ggerganov mentioned this pull request Aug 21, 2025

Misc. bug: Long-prompt decode crash with MoE #15481

Closed

qnixsynapse pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 22, 2025

CUDA: refactor FA support/selection code (ggml-org#15454)

dfdbd58

Minh141120 pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 22, 2025

CUDA: refactor FA support/selection code (ggml-org#15454)

ecc3b62

sighpher mentioned this pull request Aug 24, 2025

Misc. bug: prompt processing stall with long context on deepseek models #15514

Open

ericcurtin mentioned this pull request Aug 27, 2025

llamacpp: Enable flash attention docker/model-runner#44

Closed

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 2, 2025

Revert "CUDA: refactor FA support/selection code (ggml-org#15454)"

f1e065a

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 3, 2025

Revert "CUDA: refactor FA support/selection code (ggml-org#15454)"

f3ce201

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 9, 2025

Revert "CUDA: refactor FA support/selection code (ggml-org#15454)"

2c5757a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: refactor FA support/selection code #15454

CUDA: refactor FA support/selection code #15454

Uh oh!

JohannesGaessler commented Aug 20, 2025

Uh oh!

slaren commented Aug 20, 2025

Uh oh!

JohannesGaessler commented Aug 20, 2025

Uh oh!

slaren commented Aug 20, 2025

Uh oh!

Uh oh!

LostRuins commented Aug 21, 2025

Uh oh!

JohannesGaessler commented Aug 21, 2025

Uh oh!

JohannesGaessler commented Aug 21, 2025

Uh oh!

LostRuins commented Aug 25, 2025

Uh oh!

JohannesGaessler commented Aug 25, 2025 •

edited

Loading

Uh oh!

ericcurtin commented Aug 27, 2025

Uh oh!

Uh oh!

CUDA: refactor FA support/selection code #15454

CUDA: refactor FA support/selection code #15454

Uh oh!

Conversation

JohannesGaessler commented Aug 20, 2025

Uh oh!

slaren commented Aug 20, 2025

Uh oh!

JohannesGaessler commented Aug 20, 2025

Uh oh!

slaren commented Aug 20, 2025

Uh oh!

Uh oh!

LostRuins commented Aug 21, 2025

Uh oh!

JohannesGaessler commented Aug 21, 2025

Uh oh!

JohannesGaessler commented Aug 21, 2025

Uh oh!

LostRuins commented Aug 25, 2025

Uh oh!

JohannesGaessler commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericcurtin commented Aug 27, 2025

Uh oh!

Uh oh!

JohannesGaessler commented Aug 25, 2025 •

edited

Loading