kvcached (KV cache daemon) is a KV cache library for LLM serving/training on shared GPUs. By bringing OS-style virtual memory abstraction to LLM systems, it enables elastic and demand-driven KV cache allocation, improving GPU utilization under dynamic workloads.
kvcached achieves this by decoupling GPU virtual addressing from physical memory allocation for KV caches. It allows serving engines to initially reserve virtual memory only and later back it with physical GPU memory when the cache is actively used. This decoupling enables on-demand allocation and flexible sharing, bringing better GPU memory utilization under dynamic and mixed workloads. Check out more details in the blog.
- Elastic KV cache: allocate and reclaim KV memory dynamically to match live load.
- GPU virtual memory: decouple logical KV from physical GPU memory via runtime mapping.
- Memory control CLI: enforce memory limits with kvcached CLI.
- Frontend router and sleep mode: route requests to the target models and put models to sleep when idle.
- Support mainstream serving engines: integrate with SGLang and vLLM.
See concrete examples here: kvcached/examples.
The following simple example shows how kvcached enables an unmodified vLLM engine run with dynamically allocated memory.
kvcached enables dynamic memory sharing between LLMs, allowing them to share the same GPU memory elastically. As a comparison, the current serving engines need to statically reserve GPU memory at startup.
This benchmark shows the performance benefits of kvcached when serving three Llama-3.1-8B
models on an A100-80G GPU under workloads with intermittent peaks. kvcached can achieve 2-28x TTFT reduction compared to the current serving engines. This performance gain can be converted to significant cost savings for LLM serving. Without kvcached, the systems have to provision more GPUs to achieve the same performance.
Details can be found in benchmarks/bench_latency_benefit.
- Python (tested with 3.9 - 3.12)
- SGLang (tested with v0.5.3) or vLLM (tested with v0.11.0)
kvcached can be installed as a plugin with existing SGLang or vLLM environment.
pip install kvcached --no-build-isolation
# under the project root folder
pip install -e . --no-build-isolation --no-cache-dir
python tools/dev_copy_pth.py
kvcached installed with original engine dockers.
docker pull ghcr.io/ovg-project/kvcached-sglang:latest # kvcached-v0.1.1-sglang-v0.5.3
docker pull ghcr.io/ovg-project/kvcached-vllm:latest # kvcached-v0.1.1-vllm-v0.11.0
We prepare an all-in-one docker for developers:
docker pull ghcr.io/ovg-project/kvcached-dev:latest
More instructions can be found here. GB200 dockers are on the way.
kvcached can be enabled by setting the following environmental variables:
export ENABLE_KVCACHED=true
export KVCACHED_AUTOPATCH=1
# memory stats ipc name (optional)
export KVCACHED_IPC_NAME=[SGLANG|VLLM]
If you are using the engine-specific dockers, you can test kvcached by running the original engines' benchmark scripts. For example:
# for sglang
python -m sglang.launch_server --model meta-llama/Llama-3.2-1B --disable-radix-cache --port 30000
python -m sglang.bench_serving --backend sglang-oai --model meta-llama/Llama-3.2-1B --dataset-name sharegpt --request-rate 10 --num-prompts 1000 --port 30000
# for vllm
vllm serve meta-llama/Llama-3.2-1B --disable-log-requests --no-enable-prefix-caching --port=12346
vllm bench serve --model meta-llama/Llama-3.2-1B --request-rate 10 --num-prompts 1000 --port 12346
If you installed kvcached using its source code, you can also do the following:
cd benchmarks/simple_bench
./start_server.sh [sglang|vllm] --venv-path $VENV_PATH --model meta-llama/Llama-3.2-1B
# Wait until LLM server is ready
./start_client.sh [sglang|vllm] --venv-path $VENV_PATH --model meta-llama/Llama-3.2-1B
The benchmark scripts automatically set ENABLE_KVCACHED=true
. Please refer to each script for instructions on how to run inference with kvcached.
The latest roadmap is also tracked in issue #125.
- Engine integration
- SGLang and vLLM
- Ollama (in progress)
- llama.cpp and LMStudio
- Features
- Tensor parallelism
- Prefix caching
- KV cache offloading to host memory
- More attention types (sliding window attention, linear attention, vision encoder, etc.)
- Performance optimizations
- Contiguous KV tensor layout
- Physical memory management
- Hardware
- NVIDIA GPUs
- AMD GPUs
We are grateful for and open to contributions and collaborations of any kind.
We use pre-commit to ensure a consistent coding style. You can set it up by
pip install pre-commit
pre-commit install
Before pushing your code, please run the following check and make sure your code passes all checks.
pre-commit run --all-files
kvcached is developed by many contributors from the community. Feel free to contact us for contributions and collaborations.
Jiarong Xing ([email protected])
Yifan Qiao ([email protected])
Shan Yu ([email protected])
If you find kvcached useful, please cite our paper:
@article{xing2025towards,
title={Towards Efficient and Practical GPU Multitasking in the Era of LLM},
author={Xing, Jiarong and Qiao, Yifan and Mo, Simon and Cui, Xingqi and Sela, Gur-Eyal and Zhou, Yang and Gonzalez, Joseph and Stoica, Ion},
journal={arXiv preprint arXiv:2508.08448},
year={2025}
}
@article{yu2025prism,
title={Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving},
author={Yu, Shan and Xing, Jiarong and Qiao, Yifan and Ma, Mingyuan and Li, Yangmin and Wang, Yang and Yang, Shuo and Xie, Zhiqiang and Cao, Shiyi and Bao, Ke and others},
journal={arXiv preprint arXiv:2505.04021},
year={2025}
}