LMCache: smart caching for LLM inference.
LMCache - a project that offers a solution for storing KV caches on the CPU, disk, or even in specialized NIXL memory. Essentially, it's a tool that turns one-time computations into reusable blocks, saving time and resources.
Imagine that in a chatbot, users often refer to the same system prompt or dialogue history. Usually, the model processes this data anew, but LMCache simply loads the ready-made cache. Unloading KV caches frees up the GPU for new tasks, reducing TTFT (time to first token) by up to 10 times.
🟡LMCache is flexible.
Caches can not only be unloaded but also shared between different LLM instances. Simply put, if two users simultaneously access different copies of the model with the same request, the system won't duplicate the work: the results of one prefill will be available to everyone. This even works for incomplete prefixes, for example, when there's a partial match in the input data.
🟡LMCache supports separate preprocessing.
Prefill and decode, which are usually performed on a single GPU, can now be separated: the first stage is processed on powerful nodes, and the second on ones optimized for generation. For distributed systems, this technique will increase throughput.
The project's developers' tests show that in real tasks, the delay is reduced by 3–10 times, and GPU cycles are saved on repeated calculations.
The project is closely integrated with vLLM, and the repository includes a large set of examples, documentation, and tips on installation and configuration.
⚠️ KV cache calculator with a choice of model, its data type, and number of tokens, which will help estimate how much VRAM can be saved.
📌Licensing: Apache 2.0 License.
🖥 Github: https://lnkd.in/g6Kvxu-M
#AI #ML #LLM #LMCache #KVCache #Github