context : allow cache-less context for embeddings #13108

ggerganov · 2025-04-25T11:15:32Z

There is no need to create a KV cache when using embedding models such as BERT. This saves memory compared to master.

API Changes

The llama_encode() method is now the recommended way to compute embeddings and rerank.
llama_decode() can still be used to compute embeddings as before.
For embedding models such as BERT, llama_decode() fallbacks to llama_encode() and prints a warning.

In short, whenever the KV cache is not needed - use llama_encode(). Otherwise - use llama_decode(). The changes are backwards compatible.

ggerganov · 2025-05-02T14:51:50Z

I'll work on rebasing and merging this next - it should be a good improvement for embedding models by reducing the allocated memory during inference.

examples/embedding/embedding.cpp

ggml-ci

aviallon · 2025-05-04T07:46:37Z

Thanks for your awesome work Georgi.
Is there a donation / sponsoring page btw?

aviallon · 2025-05-09T12:18:23Z

src/llama-context.cpp

+                    // extract the rerank score - a single float per sequence
+                    auto & embd_seq_out = embd_seq;
+
+                    for (uint32_t s = 0; s < ubatch.n_seqs; ++s) {
+                        const llama_seq_id seq_id = ubatch.seq_id[s][0];
+                        if (embd_seq_out.find(seq_id) != embd_seq_out.end()) {
+                            continue;
+                        }
+                        embd_seq_out[seq_id].resize(1);
+                        ggml_backend_tensor_get_async(backend_embd, t_embd, embd_seq_out[seq_id].data(), (seq_id)*sizeof(float), sizeof(float));
+                    }
+                } break;


@ggerganov shouldn't you include a documentation change stating that rank pooling is now supported?

It was already supported before this change when using llama_decode(). Now llama_encode() also supports it. Not sure we need to document - do you have something in mind where to add docs about this?

* origin/master: (39 commits) server : vision support via libmtmd (ggml-org#12898) sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (ggml-org#12858) metal : optimize MoE for large batches (ggml-org#13388) CUDA: FA support for Deepseek (Ampere or newer) (ggml-org#13306) llama : do not crash if there is no CPU backend (ggml-org#13395) CUDA: fix crash on large batch size for MoE models (ggml-org#13384) imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (ggml-org#13389) llama-run: add support for downloading models from ModelScope (ggml-org#13370) mtmd : fix batch_view for m-rope (ggml-org#13397) llama : one-off chat template fix for Mistral-Small-2503 (ggml-org#13398) rpc : add rpc_msg_set_tensor_hash_req (ggml-org#13353) vulkan: Allow up to 4096 elements for mul_mat_id row_ids (ggml-org#13326) server : (webui) rename has_multimodal --> modalities (ggml-org#13393) ci : limit write permission to only the release step + fixes (ggml-org#13392) mtmd : Expose helper_decode_image_chunk (ggml-org#13366) server : (webui) fix a very small misalignment (ggml-org#13387) server : (webui) revamp the input area, plus many small UI improvements (ggml-org#13365) convert : support rope_scaling type and rope_type (ggml-org#13349) mtmd : fix the calculation of n_tokens for smolvlm (ggml-org#13381) context : allow cache-less context for embeddings (ggml-org#13108) ...

ggerganov mentioned this pull request Apr 25, 2025

kv-cache : separate recurrent vs non-recurrent impl #12799

Merged

8 tasks

github-actions bot added examples server labels Apr 25, 2025

ggerganov force-pushed the gg/llama-kv-cache-v6 branch 5 times, most recently from 58115a2 to 7e79a42 Compare May 2, 2025 13:02

Base automatically changed from gg/llama-kv-cache-v6 to master May 2, 2025 14:48

Green-Sky reviewed May 2, 2025

View reviewed changes

examples/embedding/embedding.cpp Outdated Show resolved Hide resolved

ggerganov added 3 commits May 3, 2025 11:21

context : allow cache-less context for embeddings

9770efa

ggml-ci

context : enable reranking with encode()

a21ff6c

ggml-ci

context : encode() clears embd_seq

c14ee72

ggml-ci

ggerganov force-pushed the gg/embeddings-no-kv branch from 4f0ea9b to c14ee72 Compare May 3, 2025 08:23

ggerganov added 3 commits May 3, 2025 17:46

examples : use llama_encode() when appropriate

c709275

ggml-ci

models : nomic bert moe does not require KV cache

97b975d

llama : update comments for llama_decode/llama_encode

3b4f6c0

ggml-ci

ggerganov marked this pull request as ready for review May 3, 2025 15:23

ggerganov requested a review from ngxson as a code owner May 3, 2025 15:23

context : update warning log [no ci]

abe25e7

ggerganov merged commit 6562e5a into master May 8, 2025
1 check passed

ggerganov deleted the gg/embeddings-no-kv branch May 8, 2025 11:28

aviallon reviewed May 9, 2025

View reviewed changes

giladgd mentioned this pull request May 12, 2025

Misc. bug: crashes when calling llama_state_get_size on a reranking model #13463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

context : allow cache-less context for embeddings #13108

context : allow cache-less context for embeddings #13108

ggerganov commented Apr 25, 2025 •

edited

Loading

ggerganov commented May 2, 2025

aviallon commented May 4, 2025

aviallon May 9, 2025

ggerganov May 9, 2025

context : allow cache-less context for embeddings #13108

context : allow cache-less context for embeddings #13108

Conversation

ggerganov commented Apr 25, 2025 • edited Loading

API Changes

ggerganov commented May 2, 2025

aviallon commented May 4, 2025

aviallon May 9, 2025

Choose a reason for hiding this comment

ggerganov May 9, 2025

Choose a reason for hiding this comment

ggerganov commented Apr 25, 2025 •

edited

Loading