Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

JohannesGaessler
Copy link
Collaborator

This PR refactors and deduplicates the CUDA code for determining which kernel to run. One of the possible return values of ggml_cuda_get_best_fattn_kernel is that there is no suitable kernel, this is re-used for determining whether the CUDA backend can run the ggml op. This PR fixes issues with e.g. Stories 260k which crashed due to unexpected head sizes, with the new code all head sizes that are not explicitly listed are treated as unsupported.

One functional change vs. master is that trying to use a quantized KV cache without GGML_CUDA_FA_ALL_QUANTS no longer results in a crash but instead falls back to the CPU code. This made the code simpler but I'm not 100% sure whether it's the right thing to do in terms of usability.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 20, 2025
@slaren
Copy link
Member

slaren commented Aug 20, 2025

One functional change vs. master is that trying to use a quantized KV cache without GGML_CUDA_FA_ALL_QUANTS no longer results in a crash but instead falls back to the CPU code. This made the code simpler but I'm not 100% sure whether it's the right thing to do in terms of usability.

Not crashing is definitely correct, but I agree that it is usually not good to enable FA when it requires fallback to CPU. I think we can add a new FA mode called "auto" or similar that behaves such as:

  • If FA is supported, then it is enabled
  • If FA is not supported, but the settings can run without FA, then it is disabled
  • If FA is not supported, but FA is required due to V-cache quantization, then it would return an error

So FA would only be enabled when it is supported, and trying to use V quantization when not supported would result in an error, rather than fallback to CPU with terrible performance.

@JohannesGaessler
Copy link
Collaborator Author

Sounds good. So for the CLI interface, something like -fa 0, -fa 1, and -fa -1? And for -fa -1, create a graph, check whether all FA ggml ops are supported, and then set to either 0 or 1?

@slaren
Copy link
Member

slaren commented Aug 20, 2025

Essentially yes, but the logic can be more granular than that. Flash attention can be enabled at a per-layer level, so if you have device A that supports FA, and device B that doesn't, you can still use FA in the layers offloaded to device A.

@JohannesGaessler JohannesGaessler merged commit 13aeb7a into ggml-org:master Aug 20, 2025
47 checks passed
@LostRuins
Copy link
Collaborator

One functional change vs. master is that trying to use a quantized KV cache without GGML_CUDA_FA_ALL_QUANTS no longer results in a crash but instead falls back to the CPU code

So FA would only be enabled when it is supported, and trying to use V quantization when not supported would result in an error, rather than fallback to CPU with terrible performance.

Can I just double check that using any K-cache is quantization can still be used independently of FA on/off, only V-cache quantization requires it, correct?

@JohannesGaessler
Copy link
Collaborator Author

Yes.

@JohannesGaessler
Copy link
Collaborator Author

I should also clarify: when I was talking about a "functional change" I meant it in the ggml backend context. llama.cpp still checks whether FA is enabled when the V cache is quantized and raises an error if it's not.

@LostRuins
Copy link
Collaborator

I noticed that rocwmma selection code seems to have been removed in this PR, and I don't see it referenced anywhere else.

https://github.com/ggml-org/llama.cpp/pull/15454/files#diff-0f4ae5d475b6fe0b22b0dc2afda5ffbe8d690feb23f161fa8c84cbbca0473649L283-L286

#if defined(GGML_HIP_ROCWMMA_FATTN)
    if (GGML_CUDA_CC_IS_AMD(cc) && fp16_mma_available(cc)) {
        ggml_cuda_flash_attn_ext_wmma_f16(ctx, dst);
        return;

is that intentional?

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented Aug 25, 2025

rocWMMA support is handled via fp16_mma_available in common.cuh.

@ericcurtin
Copy link
Collaborator

Sounds good. So for the CLI interface, something like -fa 0, -fa 1, and -fa -1? And for -fa -1, create a graph, check whether all FA ggml ops are supported, and then set to either 0 or 1?

Worth a tweet or documentation or something of what llama-server should set for -fa on completion of this...

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 2, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 3, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants