-
Notifications
You must be signed in to change notification settings - Fork 11.9k
kv-cache : add SWA support #13194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv-cache : add SWA support #13194
Conversation
It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can be dropped from the cache. However I think it is definitely worth it for the single user case, which after all is the main use case of llama.cpp. |
Yes this is what I was thinking about for months now. There is no better solution than to disable context caching in this case. An alternative solution is to allow user to choose one of the 2: either a proper SWA cache (good for memory) or allocate full (good for reusing cache)
I'm feeling 50/50 here. One of the biggest use case would be to process large and diverse set of documents locally. In this case, user may never reuse the cache because each new request is a new document |
The way I am approaching it is to have the "KV cells" information maintained separately for the non-SWA and SWA layers. This way, upon each KV cache commit (see #12799), we can do a pass over the SWA cells and automatically remove those that have position The rest of the logic is the same - it just operates on both set of cells. For example, |
My experience with the Gemma models in the context of Elo HeLLM has been that they required a disproportionate amount of computational resources to run benchmarks. The reason is that I was able to fit comparatively fewer parallel slots on 1 or 2 GPUs and my throughput was lower as a consequence. At least for my use case I value low memory usage for the context more than I value prompt caching because I have O(10000) short prompts and I'm bottlenecked mostly by generation throughput. |
Continuing thinking about the logic for when to discard tokens from the cache, it's indeed tricky and not very clear how to do. For example, when doing speculative decoding, we can submit a draft batch with |
e37f112
to
7e4b545
Compare
I second slaren's opinion. As far as I know, vllm also doesn't support iSWA while hf transformers and ollama does. vllm is geared toward multi-user server use case. I suppose that's why they don't support it. Ideally, it should be implemented as a switch to let user choose which one to use. By default, iSWA should be on for llama-cli but off for llama-server. |
Yes I was thinking about this too, I think it can be a bit complicated to manage this case, but totally possible. We can let user specify how many tokens are allocated in the sliding layers. For example, given We can further let And finally, we may need to add an API to return the furthest |
I'd +1 the ability to allow the user to switch. Some use-cases benefit greatly from the prefix caching (example: on Metal systems with 48GB of RAM/VRAM, where pp is much slower than non-Metal pp and we have plenty of VRAM anyway) so allowing the user to choose would be optimal. |
Is llama.cpp single user mode the most used case because thatβs what the user base prefer or is it like that because the server performance goes down a lot with more than 3 users ? (#10860 ) We are really thankful of all the work you main contributors do on this project, but please do not fall in this Β« self-fulfilling prophecyΒ Β» trap. |
I personally use llama.cpp for server use (with multiple users). |
58115a2
to
7e79a42
Compare
According to the Gemma3 paper, interleaved Sliding Window Attention reduces KV Cache memory usage by 1/5, so it would be much easier to run as right now KV Cache size is much heavier than comparable models. If the drawback is the absence of prompt caching, then indeed it would make sense to give the option to the user and let them decide on a per use case basis. I think for cases where you use RAG/Vector DB it would prove to be very useful as prompt caching does not work when beginning of the context changes anyway. I would personally agree with Johannes here, faster token generation thanks to SWA would be more useful for me as well since I'm using vector DB. So for the use cases short prompts/RAG it would make a lot of sense. For simple chat use cases without any RAG, prompt caching would probably make it faster overall compared to SWA and no prompt cache. Overall, I think having the option would be a great addition to llama.cpp. If it helps, Ollama implemented iSWA support for Gemma 3, since the project is pretty similar to llama.cpp, perhaps it's useful to get a rough idea on how to implement it (although Ollama is a different coding language): https://github.com/ollama/ollama/blob/2fec73eef6e9482f606f185ebb2ae4f75ad1a37c/model/models/gemma3/model_text.go#L190 I've been thinking, does Ollama support prompt caching? Since Gemma 3 SWA is supported in Ollama, how did they handle it? |
1c69466
to
1e10743
Compare
Some people recently mentioned concerns with this PR - I think caching is quite important for a subset of users who don't have GPUs and run purely CPU only. They are fine spending initial minutes or more ingesting a large initial prompts which they then reuse for many future turns - generation speed itself is usable, but the inability to cache would be crippling for such users. |
Both the old cache (i.e. more memory usage, but with advanced caching supported) and the new cache (less memory with just last-prefix caching) will be supported. Still figuring the implementation details - will likely be supported via a flag or a parameter. |
Thanks for all the feedback in this discussion. This branch should be ready for testing - I've listed some important use cases that need to be exercised. If something does not work, please let me know - at the moment I've done very little testing, so there could be some issues remaining. I will soon write up a detailed summary of the changes and the approach taken. And after that will add some comments to the code and open the PR for review. Regarding the parameter for controlling the size of the SWA cache - for now I haven't introduced it because some initial tests show that Gemma 3 remains coherent even when it "forgets" the local SWA cache - likely thanks to the data in the non-SWA cache. So I am thinking about giving this approach a try because it keeps the UX simple (i.e. we won't have to add new parameter and handle the use cases where context editing is not possible). If we determine that this breaks some important use cases, we can add the parameter - the |
To people who have the bandwidth to test models, FYI Cohere 2 arch includes R7B which is much smaller than Command-A |
Does this mean in the current implementation the model isn't executed correctly? |
FWIW, Gemma 3 worked better for me on main with Q8 cache quantization than on this branch + unquantized kv cache. |
@andportnoy It's evaluated correctly, as long as you don't use context shift, cache reuse or branching from old states. Do you do any of that in your tests? Can you provide a repro? Edit: Also don't change 2 things at the same time when testing. Use the same KV cache type, so we can rule out differences that are not relevant to the changes in this branch. |
Pushed tentative proposal:
Now the computation should always be exact when using The idea is this to be a temporary solution until we are able to automatically reprocess what is necessary. |
At the moment it seems that there is no way to disable SWA cache. It might be good now to add an option to disable SWA in |
I added When the flag is llama.cpp/tools/server/server.cpp Lines 3203 to 3210 in c699abc
And in the future we will try to do this automatically. In any case, if the user app does something wrong in this mode, they will get many warnings in the logs: llama.cpp/src/llama-kv-cache.cpp Lines 703 to 707 in c699abc
The In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that doesn't look right to me in the current design of the KV cache is that the result of find_slot
or set_full
, and the data for commit
and restore
, is part of the state of the KV cache object itself. The result of find_slot
or set_full
could be returned in an object, which then could be committed or not, but these functions shouldn't change the state of the KV cache. I think that would make the code easier to understand since too much state in an object can make it very hard to keep track of what's happening.
src/llama-graph.cpp
Outdated
if (wo_b) { | ||
//cb(cur, "kqv_wo", il); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was already here before, but should be removed regardless.
src/llama-kv-cache.cpp
Outdated
bufs.emplace_back(buf); | ||
} | ||
|
||
{ | ||
const size_t memory_size_k = size_k_bytes(); | ||
const size_t memory_size_v = size_v_bytes(); | ||
|
||
LLAMA_LOG_INFO("%s: KV self size = %7.2f MiB, K (%s): %7.2f MiB, V (%s): %7.2f MiB\n", __func__, | ||
(float)(memory_size_k + memory_size_v) / (1024.0f * 1024.0f), | ||
LLAMA_LOG_INFO("%s: size = %7.2f (%6d cells, %3d layers) MiB, K (%s): %7.2f MiB, V (%s): %7.2f MiB\n", __func__, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLAMA_LOG_INFO("%s: size = %7.2f (%6d cells, %3d layers) MiB, K (%s): %7.2f MiB, V (%s): %7.2f MiB\n", __func__, | |
LLAMA_LOG_INFO("%s: size = %7.2f MiB (%6d cells, %3d layers), K (%s): %7.2f MiB, V (%s): %7.2f MiB\n", __func__, |
ggml-ci
Agree. I will continue some refactoring around the KV cache (sketched a list of changes in the OP above) and will submit a set of PRs to improve the implementation. |
From my understanding, using full SWA cache should use less memory, right ? Well on my tests it doesn't.
llama_perf_sampler_print: sampling time = 17,73 ms / 10947 runs ( 0,00 ms per token, 617393,27 tokens per second)
llama_perf_context_print: load time = 1340,00 ms
llama_perf_context_print: prompt eval time = 1149,49 ms / 10715 tokens ( 0,11 ms per token, 9321,55 tokens per second)
llama_perf_context_print: eval time = 2502,93 ms / 231 runs ( 10,84 ms per token, 92,29 tokens per second)
llama_perf_sampler_print: sampling time = 6,37 ms / 10819 runs ( 0,00 ms per token, 1698430,14 tokens per second)
llama_perf_context_print: load time = 1282,71 ms
llama_perf_context_print: prompt eval time = 1271,69 ms / 10715 tokens ( 0,12 ms per token, 8425,78 tokens per second)
llama_perf_context_print: eval time = 1021,83 ms / 103 runs ( 9,92 ms per token, 100,80 tokens per second)
llama_perf_context_print: total time = 2362,03 ms / 10818 tokens https://huggingface.co/bartowski/c4ai-command-r7b-12-2024-GGUF |
Could you clarify? From the numbers you posted, the SWA cache uses 2GB less than the non-SWA (i.e. |
You've got it the other way around. Without the CLI-flag, SWA is enabled by default, so if you pass the CLI flag --swa-full it disables iSWA leading to higher memory usage. I was pretty confused about this too. Contrary to Slaren's good suggestion, the SWA is enabled by default which disables KV Cache shifting. I don't think this is a good idea. Default should be the old way with context shifting enabled, while a CLI flag like --SWA should enable iSWA for experienced users who know they would have to sacrifice context shifting for it. That would be much less confusing and more practical. |
I think it's just the naming a bit confused for non-technical people. |
Oh, my bad, thanks for the quick answers ! |
Gave it a quick test spin, memory used for KV Cache decreased from 3 GB to 960 MB on Gemma 3 12B with 10K context. Awesome job!! Gemma is finally an efficient model. |
Working great, 12K context: Before: 2914.00 MiB |
For non-technical users, when would you want to use --swa-full and use the old cache? |
There seems to be some issues running Llama 4 Maverick after this change. The server crashes at below location
Full back trace
Looks like llm_build_llama_iswa in llama-model.cpp can't handle the layers where there's no experts, such as the first layer of Llama 4 Maverick. Below code passed 0 into ggml_tensor * moe_out = build_moe_ffn(ffn_inp_normed,
model.layers[il].ffn_gate_inp,
model.layers[il].ffn_up_exps,
model.layers[il].ffn_gate_exps,
model.layers[il].ffn_down_exps,
nullptr,
n_expert, n_expert_used,
LLM_FFN_SILU, false,
false, 0.0,
LLAMA_EXPERT_GATING_FUNC_TYPE_SIGMOID,
il); Going back to stack frame of
|
@hjc4869 Could you propose a fix - it should be simple, but I don't have the model downloaded to do a test. Just see the logic from before this PR and apply it to the new |
I can confirm the issue with L4 Maverick. Even |
Reverting some of the changes solved my issue, though I'm not sure if all these are necessary. diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index 057f1fc17..bc51602c1 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -4803,7 +4803,22 @@ struct llm_build_llama_iswa : public llm_graph_context {
ggml_tensor * ffn_inp = ggml_add(ctx0, cur, inpSA);
cb(ffn_inp, "ffn_inp", il);
- {
+ // feed-forward network (non-MoE)
+ if (model.layers[il].ffn_gate_inp == nullptr) {
+ cur = build_norm(ffn_inp,
+ model.layers[il].ffn_norm, NULL,
+ LLM_NORM_RMS, il);
+ cb(cur, "ffn_norm", il);
+
+ cur = build_ffn(cur,
+ model.layers[il].ffn_up, model.layers[il].ffn_up_b, NULL,
+ model.layers[il].ffn_gate, model.layers[il].ffn_gate_b, NULL,
+ model.layers[il].ffn_down, model.layers[il].ffn_down_b, NULL,
+ NULL,
+ LLM_FFN_SILU, LLM_FFN_PAR, il);
+ cb(cur, "ffn_out", il);
+
+ } else if (arch == LLM_ARCH_LLAMA4) {
// llama4 MoE
ggml_tensor * ffn_inp_normed = build_norm(ffn_inp,
model.layers[il].ffn_norm, NULL,
@@ -4833,6 +4848,25 @@ struct llm_build_llama_iswa : public llm_graph_context {
cur = ggml_add(ctx0, moe_out, shexp_out);
cb(cur, "ffn_moe_out_merged", il);
+ } else {
+ // MoE branch
+ cur = build_norm(ffn_inp,
+ model.layers[il].ffn_norm, NULL,
+ LLM_NORM_RMS, il);
+ cb(cur, "ffn_norm", il);
+
+ cur = build_moe_ffn(cur,
+ model.layers[il].ffn_gate_inp,
+ model.layers[il].ffn_up_exps,
+ model.layers[il].ffn_gate_exps,
+ model.layers[il].ffn_down_exps,
+ nullptr,
+ n_expert, n_expert_used,
+ LLM_FFN_SILU, true,
+ false, 0.0,
+ LLAMA_EXPERT_GATING_FUNC_TYPE_SOFTMAX,
+ il);
+ cb(cur, "ffn_moe_out", il);
}
cur = ggml_add(ctx0, cur, ffn_inp); |
Can you confirm that #13663 fixes it? |
Yes I can confirm this one fixed my issue. |
Simple question: These changes will have a sample code like "passkeys" or "llama-cli" that shows the improvements of using SWI and the changes in code (ex. kv updates)? Also, SWI stands for "Sliding Window Attention", right?? Do you have a reference paper to learn mode about it. |
Overview
Add
class llama_kv_cache_unified_iswa
for interleaved SWA attention support.The implementation internally utilizes 2 instances of the existing
llama_kv_cache_unified
- one for the non-SWA and one for the SWA layers of the model. To achieve that, thellama_kv_cache_unified
implementation is updated to be able to cache a subset of the model's layers (instead of always caching all layers as it is onmaster
). The 2 internal caches behave almost in exactly the same way with 2 main differences:The size of the SWA cache is computed as:
This way we can store the cache data for the last
n_swa
tokens for all sequences and we also have room to evaluate a new batch of tokens with size up ton_batch
.Note that advanced cache operations such as removing tokens or shifting their positions are not possible when using SWA cache, because token information becomes lost when the window slides. For such cases, we can "fallback" to the old implementation by expanding the SWA cache size to the full context and disabling the SWA token pruning. This of course would lead to more memory usage. See the
swa_full
flag for more info.The new
llama_kv_cache_unified_iswa
can be used for non-SWA models withn_swa = n_ctx_train
.Main changes
llama-graph
tollama-kv-cache
llama-graph
tollama-kv-cache
build_attn_mha()
are now not permutedenum hparams.swa_type
to support chunked and non-chunked SWA (removehparams.n_attn_chunk
)class llama_kv_cache_unified_iswa
- new iSWA cache that internally utilizes 2 standardllama_kv_cache_unified
instancesllama_kv_cache_unified
implementation more private and polish the interfacellm_build_llama_iswa()
llama-server
now respectsllama_kv_self_can_shift(ctx)
llama_decode
now attempts to do a defrag if it fails to fit the input batch in the cachellama_decode
now correctly restores the cache state in all cases--swa-full
API changes
llama_context_params
- addbool swa_full
TODO
llama_kv_cache_unified_iswa::commit()
n_seq_max
andn_batch
to the KV cache and utilize it to determine SWA cache sizellama-server
check forllama_kv_self_can_shift
Testing
Any help with testing the following scenarios and reporting the results are highly appreciated:
Next PRs
llama_kv_cache_view
API (not useful, can be replaced with internal debugging functions)struct kv_cells
and simplify logic with modifying the cellsllama_kv_cache
logic to allow SWA cache with sizen_swa + n_ubatch
llama_decode
distinguish return code when we are sure that even after defrag there is no space availablellama_context_params
llama_kv_cache::set_full()
llama_kv_cache
to not maintain the batching state (kv-cache : add SWA supportΒ #13194 (review))template <bool SWA> llm_build_llama()
outdated
This is still very WIP - the goal is to redesign the unified KV cache to properly support layers with sliding-window attention (SWA) in order to reduce the memory usage for models such as Gemma3.
However, while working on this, I realized that enabling this option would prevent context caching, which IMO is a pretty big deal. So I am wondering if I am missing something.
The reason we cannot do context caching with SWA enabled is because when the window slides, we "forget" the old KV stuff and there is no way to recover it without recomputing it. This means, no prefix cache in
llama-server
(ok, just last-prefix caching works), no context shift, no context reuse, etc. So I am having some doubts if this is really worth supporting.Any thoughts?