Codestin Search App

Make GPU Sharing Flexible and Easy

kvcached (KV cache daemon) is a KV cache library for LLM serving/training on shared GPUs. By bringing OS-style virtual memory abstraction to LLM systems, it enables elastic and demand-driven KV cache allocation, improving GPU utilization under dynamic workloads.

kvcached achieves this by decoupling GPU virtual addressing from physical memory allocation for KV caches. It allows serving engines to initially reserve virtual memory only and later back it with physical GPU memory when the cache is actively used. This decoupling enables on-demand allocation and flexible sharing, bringing better GPU memory utilization under dynamic and mixed workloads. Check out more details in the blog.

Key Features

Elastic KV cache: allocate and reclaim KV memory dynamically to match live load.
GPU virtual memory: decouple logical KV from physical GPU memory via runtime mapping.
Memory control CLI: enforce memory limits with kvcached CLI.
Frontend router and sleep mode: route requests to the target models and put models to sleep when idle.
Support mainstream serving engines: integrate with SGLang and vLLM.

Example use cases

	Multi‑LLM serving kvcached allows multiple LLMs to share a GPU's memory elastically, enabling concurrent deployment without the rigid memory partitioning used today. This improves GPU utilization and saves serving costs.
	Serverless LLM By allocating KV cache only when needed, kvcached supports serverless deployments where models can spin up and down on demand.
	Compound AI systems kvcached makes compound AI systems practical on limited hardware by elastically allocating memory across specialized models in a pipeline (e.g., retrieval, reasoning, and summarization).
	GPU workload colocation kvcached allows LLM inference to coexist with other GPU workloads such as training jobs, fine-tuning, or vision models.

See concrete examples here: kvcached/examples.

kvcached in action

The following simple example shows how kvcached enables an unmodified vLLM engine run with dynamically allocated memory.

Performance: Multi-LLM serving

kvcached enables dynamic memory sharing between LLMs, allowing them to share the same GPU memory elastically. As a comparison, the current serving engines need to statically reserve GPU memory at startup.

This benchmark shows the performance benefits of kvcached when serving three Llama-3.1-8B models on an A100-80G GPU under workloads with intermittent peaks. kvcached can achieve 2-28x TTFT reduction compared to the current serving engines. This performance gain can be converted to significant cost savings for LLM serving. Without kvcached, the systems have to provision more GPUs to achieve the same performance. Details can be found in benchmarks/bench_latency_benefit.

Installation

Prerequisites

Python (tested with 3.9 - 3.12)
SGLang (tested with v0.5.3) or vLLM (tested with v0.11.0)

kvcached can be installed as a plugin with existing SGLang or vLLM environment.

Install from PyPI

pip install kvcached --no-build-isolation

Install from source

# under the project root folder

pip install -e . --no-build-isolation --no-cache-dir
python tools/dev_copy_pth.py

Using Docker

kvcached installed with original engine dockers.

docker pull ghcr.io/ovg-project/kvcached-sglang:latest   # kvcached-v0.1.1-sglang-v0.5.3
docker pull ghcr.io/ovg-project/kvcached-vllm:latest     # kvcached-v0.1.1-vllm-v0.11.0

We prepare an all-in-one docker for developers:

docker pull ghcr.io/ovg-project/kvcached-dev:latest

More instructions can be found here. GB200 dockers are on the way.

Testing

kvcached can be enabled by setting the following environmental variables:

export ENABLE_KVCACHED=true
export KVCACHED_AUTOPATCH=1

# memory stats ipc name (optional)
export KVCACHED_IPC_NAME=[SGLANG|VLLM]

If you are using the engine-specific dockers, you can test kvcached by running the original engines' benchmark scripts. For example:

# for sglang
python -m sglang.launch_server --model meta-llama/Llama-3.2-1B --disable-radix-cache --port 30000
python -m sglang.bench_serving --backend sglang-oai --model meta-llama/Llama-3.2-1B --dataset-name sharegpt --request-rate 10 --num-prompts 1000 --port 30000

# for vllm
vllm serve meta-llama/Llama-3.2-1B --disable-log-requests --no-enable-prefix-caching --port=12346
vllm bench serve --model meta-llama/Llama-3.2-1B --request-rate 10 --num-prompts 1000 --port 12346

If you installed kvcached using its source code, you can also do the following:

cd benchmarks/simple_bench
./start_server.sh [sglang|vllm] --venv-path $VENV_PATH --model meta-llama/Llama-3.2-1B
# Wait until LLM server is ready
./start_client.sh [sglang|vllm] --venv-path $VENV_PATH --model meta-llama/Llama-3.2-1B

The benchmark scripts automatically set ENABLE_KVCACHED=true. Please refer to each script for instructions on how to run inference with kvcached.

Roadmap

The latest roadmap is also tracked in issue #125.

Engine integration
- SGLang and vLLM
- Ollama (in progress)
- llama.cpp and LMStudio
Features
- Tensor parallelism
- Prefix caching
- KV cache offloading to host memory
- More attention types (sliding window attention, linear attention, vision encoder, etc.)
Performance optimizations
- Contiguous KV tensor layout
- Physical memory management
Hardware
- NVIDIA GPUs
- AMD GPUs

Contributing

We are grateful for and open to contributions and collaborations of any kind.

We use pre-commit to ensure a consistent coding style. You can set it up by

pip install pre-commit
pre-commit install

Before pushing your code, please run the following check and make sure your code passes all checks.

pre-commit run --all-files

Contacts

kvcached is developed by many contributors from the community. Feel free to contact us for contributions and collaborations.

Jiarong Xing ([email protected])
Yifan Qiao ([email protected])
Shan Yu ([email protected])

Citation

If you find kvcached useful, please cite our paper:

@article{xing2025towards,
  title={Towards Efficient and Practical GPU Multitasking in the Era of LLM},
  author={Xing, Jiarong and Qiao, Yifan and Mo, Simon and Cui, Xingqi and Sela, Gur-Eyal and Zhou, Yang and Gonzalez, Joseph and Stoica, Ion},
  journal={arXiv preprint arXiv:2508.08448},
  year={2025}
}

@article{yu2025prism,
  title={Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving},
  author={Yu, Shan and Xing, Jiarong and Qiao, Yifan and Ma, Mingyuan and Li, Yangmin and Wang, Yang and Yang, Shuo and Xie, Zhiqiang and Cao, Shiyi and Bao, Ke and others},
  journal={arXiv preprint arXiv:2505.04021},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.gemini		.gemini
.github/workflows		.github/workflows
assets		assets
benchmarks		benchmarks
controller		controller
csrc		csrc
docker		docker
engine_integration		engine_integration
examples		examples
kvcached		kvcached
tests		tests
tools		tools
.clang-format		.clang-format
.gitignore		.gitignore
.license-header.txt		.license-header.txt
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
kvcached_autopatch.pth		kvcached_autopatch.pth
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Make GPU Sharing Flexible and Easy

Key Features

Example use cases

kvcached in action

Performance: Multi-LLM serving

Installation

Prerequisites

Install from PyPI

Install from source

Using Docker

Testing

Roadmap

Contributing

Contacts

Citation

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors 13

Languages

License

ovg-project/kvcached

Folders and files

Latest commit

History

Repository files navigation

Make GPU Sharing Flexible and Easy

Key Features

Example use cases

kvcached in action

Performance: Multi-LLM serving

Installation

Prerequisites

Install from PyPI

Install from source

Using Docker

Testing

Roadmap

Contributing

Contacts

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors 13

Languages

Packages