kv-cache : add SWA support #13194

ggerganov · 2025-04-29T18:08:56Z

Overview

Add class llama_kv_cache_unified_iswa for interleaved SWA attention support.

The implementation internally utilizes 2 instances of the existing llama_kv_cache_unified - one for the non-SWA and one for the SWA layers of the model. To achieve that, the llama_kv_cache_unified implementation is updated to be able to cache a subset of the model's layers (instead of always caching all layers as it is on master). The 2 internal caches behave almost in exactly the same way with 2 main differences:

The SWA cache is much smaller
The SWA cache automatically "forgets/prunes" old tokens upon successful commit (i.e. successful batch decode)

The size of the SWA cache is computed as:

PAD(n_swa*n_seq_max + n_batch)

This way we can store the cache data for the last n_swa tokens for all sequences and we also have room to evaluate a new batch of tokens with size up to n_batch.

Note that advanced cache operations such as removing tokens or shifting their positions are not possible when using SWA cache, because token information becomes lost when the window slides. For such cases, we can "fallback" to the old implementation by expanding the SWA cache size to the full context and disabling the SWA token pruning. This of course would lead to more memory usage. See the swa_full flag for more info.

The new llama_kv_cache_unified_iswa can be used for non-SWA models with n_swa = n_ctx_train.

Main changes

Move KV cache store and view logic from llama-graph to llama-kv-cache
Move KV cache mask creation logic from llama-graph to llama-kv-cache
The inputs to build_attn_mha() are now not permuted

The QKV self-attention code is now more harmonious:

  const llama_kv_cache_unified * kv_self = static_cast<const llama_kv_cache_unified *>(memory);

  // store to KV cache
  {
      ggml_build_forward_expand(gf, kv_self->cpy_k(ctx0, k_cur, il));
      ggml_build_forward_expand(gf, kv_self->cpy_v(ctx0, v_cur, il));
  }

  const auto & kq_mask = inp->get_kq_mask();

  ggml_tensor * q = q_cur;
  ggml_tensor * k = kv_self->get_k(ctx0, il);
  ggml_tensor * v = kv_self->get_v(ctx0, il);

  ggml_tensor * cur = build_attn_mha(gf, q, k, v, kq_b, kq_mask, v_mla, kq_scale);
  cb(cur, "kqv_out", il);

Add enum hparams.swa_type to support chunked and non-chunked SWA (remove hparams.n_attn_chunk)
Add class llama_kv_cache_unified_iswa - new iSWA cache that internally utilizes 2 standard llama_kv_cache_unified instances
Make the llama_kv_cache_unified implementation more private and polish the interface
Move the Llama 4 build function to a new llm_build_llama_iswa()
llama-server now respects llama_kv_self_can_shift(ctx)
The llama_decode now attempts to do a defrag if it fails to fit the input batch in the cache
The llama_decode now correctly restores the cache state in all cases
Examples can fallback to full-size SWA cache with --swa-full

API changes

Update llama_context_params - add bool swa_full

TODO

Cut-off old SWA tokens in llama_kv_cache_unified_iswa::commit()
Pass n_seq_max and n_batch to the KV cache and utilize it to determine SWA cache size
Allow KV shift when SWA window size is big enough
Add limits to batch size based on SWA window
llama-server check for llama_kv_self_can_shift
Add context parameter for switching between small and large SWA cache (kv-cache : add SWA support #13194 (comment))

Testing

Any help with testing the following scenarios and reporting the results are highly appreciated:

Next PRs

outdated

This is still very WIP - the goal is to redesign the unified KV cache to properly support layers with sliding-window attention (SWA) in order to reduce the memory usage for models such as Gemma3.

However, while working on this, I realized that enabling this option would prevent context caching, which IMO is a pretty big deal. So I am wondering if I am missing something.

The reason we cannot do context caching with SWA enabled is because when the window slides, we "forget" the old KV stuff and there is no way to recover it without recomputing it. This means, no prefix cache in llama-server (ok, just last-prefix caching works), no context shift, no context reuse, etc. So I am having some doubts if this is really worth supporting.

Any thoughts?

slaren · 2025-04-29T18:14:58Z

It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can be dropped from the cache. However I think it is definitely worth it for the single user case, which after all is the main use case of llama.cpp.

ngxson · 2025-04-29T18:40:53Z

However, while working on this, I realized that enabling this option would prevent context caching, which IMO is a pretty big deal. So I am wondering if I am missing something.

Yes this is what I was thinking about for months now. There is no better solution than to disable context caching in this case.

An alternative solution is to allow user to choose one of the 2: either a proper SWA cache (good for memory) or allocate full (good for reusing cache)

So I am having some doubts if this is really worth supporting.

I'm feeling 50/50 here. One of the biggest use case would be to process large and diverse set of documents locally. In this case, user may never reuse the cache because each new request is a new document

ggerganov · 2025-04-29T19:19:06Z

It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can be dropped from the cache. However I think it is definitely worth it for the single user case, which after all is the main use case of llama.cpp.

The way I am approaching it is to have the "KV cells" information maintained separately for the non-SWA and SWA layers. This way, upon each KV cache commit (see #12799), we can do a pass over the SWA cells and automatically remove those that have position pos < pos_max(seq_id) - n_swa. Note that such tokens are only pruned from the SWA cells, while they remain in the non-SWA cells. When constructing the KQ mask for the graph, we use the non-SWA cells to construct the kq_mask and the SWA cells to construct the kq_mask_swa.

The rest of the logic is the same - it just operates on both set of cells. For example, find_slot searches in both the non-SWA and SWA cells.

JohannesGaessler · 2025-04-29T20:54:59Z

My experience with the Gemma models in the context of Elo HeLLM has been that they required a disproportionate amount of computational resources to run benchmarks. The reason is that I was able to fit comparatively fewer parallel slots on 1 or 2 GPUs and my throughput was lower as a consequence. At least for my use case I value low memory usage for the context more than I value prompt caching because I have O(10000) short prompts and I'm bottlenecked mostly by generation throughput.

ggerganov · 2025-04-30T05:18:31Z

Continuing thinking about the logic for when to discard tokens from the cache, it's indeed tricky and not very clear how to do. For example, when doing speculative decoding, we can submit a draft batch with D tokens to the target model. If we apply the pruning logic from my previous comment strictly, then this would cause to "forget" D-1 of the oldest tokens in the SWA layers, which depending if the draft gets rejected would be problematic. This makes me think that we should probably have some "extra room" in the SWA cache - for example n_swa + 2*n_batch. And the prune logic should be something like: pos < pos_max(seq_id) - n_swa - n_batch.

ymcki · 2025-04-30T07:40:25Z

It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can be dropped from the cache. However I think it is definitely worth it for the single user case, which after all is the main use case of llama.cpp.

I second slaren's opinion. As far as I know, vllm also doesn't support iSWA while hf transformers and ollama does. vllm is geared toward multi-user server use case. I suppose that's why they don't support it.

Ideally, it should be implemented as a switch to let user choose which one to use. By default, iSWA should be on for llama-cli but off for llama-server.

ngxson · 2025-04-30T09:35:20Z

This makes me think that we should probably have some "extra room" in the SWA cache - for example n_swa + 2*n_batch. And the prune logic should be something like: pos < pos_max(seq_id) - n_swa - n_batch.

Yes I was thinking about this too, I think it can be a bit complicated to manage this case, but totally possible.

We can let user specify how many tokens are allocated in the sliding layers. For example, given n_swa=512, if llama_context is created with n_ctx=4096 and n_ctx_swa=1024, this will allow user to rollback until n_past - (1024 - 512)

We can further let n_ctx_swa = n_ctx * scale by default to make it transparent to end-user, with scale=0.5 by default for example. If scale=-1 then n_ctx_swa=n_swa

And finally, we may need to add an API to return the furthest n_past that user can rollback to, maybe something like llama_kv_self_get_minimum_pos ?

isaac-mcfadyen · 2025-04-30T14:53:02Z

I'd +1 the ability to allow the user to switch.

Some use-cases benefit greatly from the prefix caching (example: on Metal systems with 48GB of RAM/VRAM, where pp is much slower than non-Metal pp and we have plenty of VRAM anyway) so allowing the user to choose would be optimal.

ExtReMLapin · 2025-05-01T04:15:46Z

It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can be dropped from the cache. However I think it is definitely worth it for the single user case, which after all is the main use case of llama.cpp.

Is llama.cpp single user mode the most used case because that’s what the user base prefer or is it like that because the server performance goes down a lot with more than 3 users ? (#10860 )

We are really thankful of all the work you main contributors do on this project, but please do not fall in this « self-fulfilling prophecy » trap.

aviallon · 2025-05-01T07:36:28Z

I personally use llama.cpp for server use (with multiple users).
I wonder if we could do something hybrid between iSWA and what is currently done.
I wonder if partial kV cache offload could work, with iSWA on the accelerator, and slower cache on RAM.

Dampfinchen · 2025-05-03T12:45:51Z

According to the Gemma3 paper, interleaved Sliding Window Attention reduces KV Cache memory usage by 1/5, so it would be much easier to run as right now KV Cache size is much heavier than comparable models.

If the drawback is the absence of prompt caching, then indeed it would make sense to give the option to the user and let them decide on a per use case basis. I think for cases where you use RAG/Vector DB it would prove to be very useful as prompt caching does not work when beginning of the context changes anyway. I would personally agree with Johannes here, faster token generation thanks to SWA would be more useful for me as well since I'm using vector DB.

So for the use cases short prompts/RAG it would make a lot of sense. For simple chat use cases without any RAG, prompt caching would probably make it faster overall compared to SWA and no prompt cache. Overall, I think having the option would be a great addition to llama.cpp.

If it helps, Ollama implemented iSWA support for Gemma 3, since the project is pretty similar to llama.cpp, perhaps it's useful to get a rough idea on how to implement it (although Ollama is a different coding language): https://github.com/ollama/ollama/blob/2fec73eef6e9482f606f185ebb2ae4f75ad1a37c/model/models/gemma3/model_text.go#L190

I've been thinking, does Ollama support prompt caching? Since Gemma 3 SWA is supported in Ollama, how did they handle it?

LostRuins · 2025-05-10T00:30:05Z

Some people recently mentioned concerns with this PR - I think caching is quite important for a subset of users who don't have GPUs and run purely CPU only.

They are fine spending initial minutes or more ingesting a large initial prompts which they then reuse for many future turns - generation speed itself is usable, but the inability to cache would be crippling for such users.

ggerganov · 2025-05-10T04:29:42Z

Both the old cache (i.e. more memory usage, but with advanced caching supported) and the new cache (less memory with just last-prefix caching) will be supported. Still figuring the implementation details - will likely be supported via a flag or a parameter.

ggerganov · 2025-05-11T16:12:17Z

Thanks for all the feedback in this discussion. This branch should be ready for testing - I've listed some important use cases that need to be exercised. If something does not work, please let me know - at the moment I've done very little testing, so there could be some issues remaining.

I will soon write up a detailed summary of the changes and the approach taken. And after that will add some comments to the code and open the PR for review.

Regarding the parameter for controlling the size of the SWA cache - for now I haven't introduced it because some initial tests show that Gemma 3 remains coherent even when it "forgets" the local SWA cache - likely thanks to the data in the non-SWA cache. So I am thinking about giving this approach a try because it keeps the UX simple (i.e. we won't have to add new parameter and handle the use cases where context editing is not possible). If we determine that this breaks some important use cases, we can add the parameter - the libllama change is simple and the behavior would basically fallback to what currently happens on master.

ExtReMLapin · 2025-05-11T17:23:52Z

To people who have the bandwidth to test models, FYI Cohere 2 arch includes R7B which is much smaller than Command-A

andportnoy · 2025-05-11T18:09:19Z

for now I haven't introduced it because some initial tests show that Gemma 3 remains coherent even when it "forgets" the local SWA cache

Does this mean in the current implementation the model isn't executed correctly?

andportnoy · 2025-05-11T18:45:42Z

FWIW, Gemma 3 worked better for me on main with Q8 cache quantization than on this branch + unquantized kv cache.

ggerganov · 2025-05-11T18:53:53Z

@andportnoy It's evaluated correctly, as long as you don't use context shift, cache reuse or branching from old states. Do you do any of that in your tests? Can you provide a repro?

Edit: Also don't change 2 things at the same time when testing. Use the same KV cache type, so we can rule out differences that are not relevant to the changes in this branch.

ngxson · 2025-05-20T08:23:10Z

I think it's just the naming a bit confused for non-technical people. --swa-full means "allocate full memory for SWA" but it could be misinterpreted as "use SWA for all layers". I think we can add a simpler an alias like -no-swa

ExtReMLapin · 2025-05-20T08:42:17Z

Oh, my bad, thanks for the quick answers !

Dampfinchen · 2025-05-20T08:58:16Z

Gave it a quick test spin, memory used for KV Cache decreased from 3 GB to 960 MB on Gemma 3 12B with 10K context. Awesome job!! Gemma is finally an efficient model.

RodriMora · 2025-05-20T11:33:09Z

Working great, 12K context:

Before: 2914.00 MiB
After: 624.00 MiB

chigkim · 2025-05-20T11:58:25Z

For non-technical users, when would you want to use --swa-full and use the old cache?
Could someone provide any example use case for why someone might want to go back?
Thanks!

hjc4869 · 2025-05-20T14:39:22Z

There seems to be some issues running Llama 4 Maverick after this change. The server crashes at below location

Thread 1 "llama-server" received signal SIGSEGV, Segmentation fault.
ggml_can_mul_mat (t0=0x0, t1=0x55556ba0bb00) at /home/david/Development/llama.cpp/ggml/src/ggml.c:2728
2728        return (t0->ne[0]           == t1->ne[0])  &&

Full back trace

(gdb) bt
#0  ggml_can_mul_mat (t0=0x0, t1=0x55556ba0bb00) at /home/david/Development/llama.cpp/ggml/src/ggml.c:2728
#1  ggml_mul_mat (ctx=0x555568b84db0, a=0x0, b=0x55556ba0bb00) at /home/david/Development/llama.cpp/ggml/src/ggml.c:2737
#2  0x00007ffff7de7fd5 in llm_graph_context::build_lora_mm (this=0x555569275780, w=0x0, cur=0x55556ba0bb00)
    at /home/david/Development/llama.cpp/src/llama-graph.cpp:476
#3  0x00007ffff7de8a9c in llm_graph_context::build_moe_ffn (this=0x555568b84db0, cur=0x55556ba0bb00, gate_inp=0x0, up_exps=0x0, gate_exps=0x0, 
    down_exps=0x0, exp_probs_b=0x0, n_expert=128, n_expert_used=1, type_op=LLM_FFN_SILU, norm_w=<optimized out>, scale_w=<optimized out>, 
    w_scale=<error reading variable: Value cannot be represented as integer of 8 bytes.>, gating_op=LLAMA_EXPERT_GATING_FUNC_TYPE_SIGMOID, il=0)
    at /home/david/Development/llama.cpp/src/llama-graph.cpp:709
#4  0x00007ffff7e3769e in llm_build_llama_iswa::llm_build_llama_iswa (this=0x555569275780, model=..., params=..., gf=0x55556b804a30)
    at /home/david/Development/llama.cpp/src/llama-model.cpp:4813
#5  0x00007ffff7e31883 in std::make_unique<llm_build_llama_iswa, llama_model const&, llm_graph_params const&, ggml_cgraph*&> (
    __args=@0x7fffffff8d28: 0x55556b804a30, __args=@0x7fffffff8d28: 0x55556b804a30, __args=@0x7fffffff8d28: 0x55556b804a30)
    at /usr/lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/unique_ptr.h:1077
#6  0x00007ffff7e3031e in llama_model::build_graph (this=0x55555ab14a20, params=..., gf=0x55556b804a30, type=LLM_GRAPH_TYPE_DEFAULT)
    at /home/david/Development/llama.cpp/src/llama-model.cpp:13267
#7  0x00007ffff7dbf776 in llama_context::graph_build (this=this@entry=0x555569974ca0, ctx=<optimized out>, gf=0x55556ba0bb00, ubatch=..., 
    gtype=gtype@entry=LLM_GRAPH_TYPE_DEFAULT) at /home/david/Development/llama.cpp/src/llama-context.cpp:1240
#8  0x00007ffff7dbebbb in llama_context::llama_context (this=0x555569974ca0, model=..., params=...)
    at /home/david/Development/llama.cpp/src/llama-context.cpp:292
#9  0x00007ffff7dc5031 in llama_init_from_model (model=0x55555ab14a20, params=...) at /home/david/Development/llama.cpp/src/llama-context.cpp:2131
#10 0x00005555557a1422 in common_init_from_params (params=...) at /home/david/Development/llama.cpp/common/common.cpp:925
#11 0x0000555555611211 in server_context::load_model (this=this@entry=0x7fffffffc010, params=...)
    at /home/david/Development/llama.cpp/tools/server/server.cpp:1912
#12 0x00005555555e849d in main (argc=<optimized out>, argv=<optimized out>) at /home/david/Development/llama.cpp/tools/server/server.cpp:4820

Looks like llm_build_llama_iswa in llama-model.cpp can't handle the layers where there's no experts, such as the first layer of Llama 4 Maverick. Below code passed 0 into build_moe_ffn

                ggml_tensor * moe_out = build_moe_ffn(ffn_inp_normed,
                        model.layers[il].ffn_gate_inp,
                        model.layers[il].ffn_up_exps,
                        model.layers[il].ffn_gate_exps,
                        model.layers[il].ffn_down_exps,
                        nullptr,
                        n_expert, n_expert_used,
                        LLM_FFN_SILU, false,
                        false, 0.0,
                        LLAMA_EXPERT_GATING_FUNC_TYPE_SIGMOID,
                        il);

Going back to stack frame of llm_build_llama_iswa:

(gdb) print il
$1 = 0

ggerganov · 2025-05-20T14:48:30Z

@hjc4869 Could you propose a fix - it should be simple, but I don't have the model downloaded to do a test. Just see the logic from before this PR and apply it to the new llm_build_llama_iswa().

stduhpf · 2025-05-20T15:04:25Z

I can confirm the issue with L4 Maverick. Even --swa-full isn't enough to work around it sadly.

hjc4869 · 2025-05-20T15:08:49Z

Reverting some of the changes solved my issue, though I'm not sure if all these are necessary.

diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index 057f1fc17..bc51602c1 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -4803,7 +4803,22 @@ struct llm_build_llama_iswa : public llm_graph_context {
             ggml_tensor * ffn_inp = ggml_add(ctx0, cur, inpSA);
             cb(ffn_inp, "ffn_inp", il);
 
-            {
+            // feed-forward network (non-MoE)
+            if (model.layers[il].ffn_gate_inp == nullptr) {
+                cur = build_norm(ffn_inp,
+                        model.layers[il].ffn_norm, NULL,
+                        LLM_NORM_RMS, il);
+                cb(cur, "ffn_norm", il);
+
+                cur = build_ffn(cur,
+                        model.layers[il].ffn_up,   model.layers[il].ffn_up_b,   NULL,
+                        model.layers[il].ffn_gate, model.layers[il].ffn_gate_b, NULL,
+                        model.layers[il].ffn_down, model.layers[il].ffn_down_b, NULL,
+                        NULL,
+                        LLM_FFN_SILU, LLM_FFN_PAR, il);
+                cb(cur, "ffn_out", il);
+
+            } else if (arch == LLM_ARCH_LLAMA4) {
                 // llama4 MoE
                 ggml_tensor * ffn_inp_normed = build_norm(ffn_inp,
                         model.layers[il].ffn_norm, NULL,
@@ -4833,6 +4848,25 @@ struct llm_build_llama_iswa : public llm_graph_context {
 
                 cur = ggml_add(ctx0, moe_out, shexp_out);
                 cb(cur, "ffn_moe_out_merged", il);
+            } else {
+                // MoE branch
+                cur = build_norm(ffn_inp,
+                        model.layers[il].ffn_norm, NULL,
+                        LLM_NORM_RMS, il);
+                cb(cur, "ffn_norm", il);
+
+                cur = build_moe_ffn(cur,
+                        model.layers[il].ffn_gate_inp,
+                        model.layers[il].ffn_up_exps,
+                        model.layers[il].ffn_gate_exps,
+                        model.layers[il].ffn_down_exps,
+                        nullptr,
+                        n_expert, n_expert_used,
+                        LLM_FFN_SILU, true,
+                        false, 0.0,
+                        LLAMA_EXPERT_GATING_FUNC_TYPE_SOFTMAX,
+                        il);
+                cb(cur, "ffn_moe_out", il);
             }
 
             cur = ggml_add(ctx0, cur, ffn_inp);

ggerganov · 2025-05-20T16:08:47Z

Can you confirm that #13663 fixes it?

hjc4869 · 2025-05-20T16:16:22Z

Yes I can confirm this one fixed my issue.

rhvall · 2025-05-21T06:58:13Z

Simple question: These changes will have a sample code like "passkeys" or "llama-cli" that shows the improvements of using SWI and the changes in code (ex. kv updates)?

Also, SWI stands for "Sliding Window Attention", right?? Do you have a reference paper to learn mode about it.

* kv-cache : prepare for SWA ggml-ci * kv-cache : initial iSWA implementation ggml-ci * kv-cache : rework error recovery logic ggml-ci * models : fix Phi-3 SWA parameters ggml-ci * model : adjust Granite to rope factor changes ggml-ci * server : check if context can do shifts ggml-ci * iswa : for now, always enable shifts (experiment) ggml-ci * kv-cache : simplify SWA logic ggml-ci * kv-cache : apply defrag when we fail to find slots for the batch ggml-ci * llama : update docs about llama_decode ggml-ci * kv-cache : update warning logs when no space for the batch is available ggml-ci * llama : add llama_kv_self_seq_pos_min() * kv-cache : keep track of partial SWA computes and print warnings * server : disallow use cases involving partial SWA context ggml-ci * llama : add param to control SWA cache size ggml-ci * minor : clean-up ggml-ci

k4ss4n · 2025-06-07T01:23:25Z

I'm currently debugging a minimal example of model load in c++ and unless

ctx_params.swa_full = false;

is set I always get

ggml_backend_cuda_buffer_type_alloc_buffer: allocating [*exact* size of available VRAM] MiB on device 0: cudaMalloc failed: out of memory

no matter context size, gpu layers etc. When forced to cpu only it runs, but still allocates [exact size of available VRAM] RAM(!) and way more than running with llama-server.

I guessed the reason after comparing my apps output with the llama-server output that runs fine with the same parameters(context size, gpu layers etc) as my app uses. The different output included a link to this topic. I'm using a HIP/ROCm enabled linux environment. I don't know how to debug it and wanted to leave this in case someone else has the same problem. Please disregard if I'm missing something obvious.

ddh0 · 2025-06-07T02:00:29Z

@k4ss4n you should probably open a new issue, and provide the code you're using as well as the full log output. feel free to ping me and I can take a look to check for anything obviously wrong.

k4ss4n · 2025-06-07T02:31:40Z

@ddh0

        llama_context_params ctx_params = llama_context_default_params();

        ctx_params.n_ctx = 50000;
        ctx_params.n_threads = 12;
        ctx_params.rope_freq_scale = 0.9;
        ctx_params.swa_full = false; //otherwise allocates all VRAM in a single block

        llama_model_params model_params = llama_model_default_params();
        model_params.n_gpu_layers = 50;

        m_model = llama_model_load_from_file(<MODEL PATH>, model_params);
        m_llama_context = llama_init_from_model(m_model, ctx_params);

This works as long as ctx_param.swa_full = false on my end. If removed it shows the described behavior. Thank you for taking a look.

edit: I suspect it might be connected to the model in use which is https://huggingface.co/unsloth/gemma-3-12b-it-GGUF Q4_K_M quantization and its default params?

ddh0 · 2025-06-07T03:01:11Z

You should probably open a new issue for this

ggerganov force-pushed the gg/llama-kv-cache-v6 branch from e37f112 to 7e4b545 Compare April 30, 2025 07:22

compilade mentioned this pull request May 1, 2025

llama : support Jamba hybrid Transformer-Mamba models #7531

Draft

17 tasks

ggerganov force-pushed the gg/llama-kv-cache-v6 branch 2 times, most recently from 58115a2 to 7e79a42 Compare May 2, 2025 13:02

Base automatically changed from gg/llama-kv-cache-v6 to master May 2, 2025 14:48

ggerganov force-pushed the gg/swa branch 3 times, most recently from 1c69466 to 1e10743 Compare May 9, 2025 12:15

ggerganov force-pushed the gg/swa branch from 9438c70 to 4feadaa Compare May 10, 2025 15:48

ggerganov mentioned this pull request May 11, 2025

llama : Support llama 4 text-only #12791

Merged

3 tasks

github-actions bot added examples server labels May 11, 2025

This was referenced May 20, 2025

changelog : libllama API #9289

Open

llama : remove llama_kv_cache_view API + remove deprecated #13653

Merged

ggerganov mentioned this pull request May 20, 2025

model : fix llama4 graph #13663

Merged

xunjieliu mentioned this pull request May 21, 2025

Reddit News Daily 2025-05-21 xunjieliu/reddit-daily-news#81

Open

rhvall mentioned this pull request May 21, 2025

hparams : support models for which all layers use SWA #13682

Merged

This was referenced May 22, 2025

kv-cache : rework kv_cell #13706

Merged

Misc. bug: Eval bug: Repetitive Output After Certain Token Count When Using -np > 1 in llama.cpp (Ver. b5468) #13733

Closed

0cc4m mentioned this pull request May 24, 2025

Move GLM4 f32 attention fix to the correct function #13750

Merged

ggerganov mentioned this pull request May 25, 2025

kv-cache : refactor + add llama_memory_state_i #13746

Merged

5 tasks

spliznork mentioned this pull request Jun 2, 2025

Misc. bug: Using draft model with Gemma producing error "get_logits_ith: invalid logits id 0" #13963

Closed

k4ss4n mentioned this pull request Jun 11, 2025

Misc. bug: "llama_context_params::swa_full = true" causes very large RAM/VRAM usage #14123

Open

vbooka1 mentioned this pull request Jun 19, 2025

Eval bug: Command-A forces full-prompt re-processing due to lack of cache data #14157

Closed

ggerganov mentioned this pull request Jun 22, 2025

examples : fix is_first logic for tokenization #14329

Merged

kv-cache : add SWA support #13194

kv-cache : add SWA support #13194

Conversation

ggerganov commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Main changes

API changes

TODO

Testing

Next PRs

Uh oh!

slaren commented Apr 29, 2025

Uh oh!

ngxson commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Apr 29, 2025

Uh oh!

JohannesGaessler commented Apr 29, 2025

Uh oh!

ggerganov commented Apr 30, 2025

Uh oh!

ymcki commented Apr 30, 2025

Uh oh!

ngxson commented Apr 30, 2025

Uh oh!

isaac-mcfadyen commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ExtReMLapin commented May 1, 2025

Uh oh!

aviallon commented May 1, 2025

Uh oh!

Dampfinchen commented May 3, 2025

Uh oh!

LostRuins commented May 10, 2025

Uh oh!

ggerganov commented May 10, 2025

Uh oh!

ggerganov commented May 11, 2025

Uh oh!

ExtReMLapin commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andportnoy commented May 11, 2025

Uh oh!

andportnoy commented May 11, 2025

Uh oh!

ggerganov commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ExtReMLapin commented May 20, 2025

Uh oh!

Dampfinchen commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RodriMora commented May 20, 2025

Uh oh!

chigkim commented May 20, 2025

Uh oh!

hjc4869 commented May 20, 2025

Uh oh!

ggerganov commented May 20, 2025

Uh oh!

stduhpf commented May 20, 2025

Uh oh!

hjc4869 commented May 20, 2025

Uh oh!

ggerganov commented May 20, 2025

Uh oh!

hjc4869 commented May 20, 2025

Uh oh!

rhvall commented May 21, 2025

Uh oh!

k4ss4n commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ggerganov commented Apr 29, 2025 •

edited

Loading

ngxson commented Apr 29, 2025 •

edited

Loading

isaac-mcfadyen commented Apr 30, 2025 •

edited

Loading

ExtReMLapin commented May 11, 2025 •

edited

Loading

ggerganov commented May 11, 2025 •

edited

Loading

ngxson commented May 20, 2025 •

edited

Loading

Dampfinchen commented May 20, 2025 •

edited

Loading

k4ss4n commented Jun 7, 2025 •

edited

Loading

k4ss4n commented Jun 7, 2025 •

edited

Loading