Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ovg-project/kvcached

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kvcached logo

Python Engines Blog arXiv: GPU OS vision
arXiv: Multi LLM Serving Slack Join License

Make GPU Sharing Flexible and Easy

Make GPU Sharing Flexible and Easy

kvcached (KV cache daemon) is a KV cache library for LLM serving/training on shared GPUs. By bringing OS-style virtual memory abstraction to LLM systems, it enables elastic and demand-driven KV cache allocation, improving GPU utilization under dynamic workloads.

kvcached achieves this by decoupling GPU virtual addressing from physical memory allocation for KV caches. It allows serving engines to initially reserve virtual memory only and later back it with physical GPU memory when the cache is actively used. This decoupling enables on-demand allocation and flexible sharing, bringing better GPU memory utilization under dynamic and mixed workloads. Check out more details in the blog.

Key Features

  • Elastic KV cache: allocate and reclaim KV memory dynamically to match live load.
  • GPU virtual memory: decouple logical KV from physical GPU memory via runtime mapping.
  • Memory control CLI: enforce memory limits with kvcached CLI.
  • Frontend router and sleep mode: route requests to the target models and put models to sleep when idle.
  • Support mainstream serving engines: integrate with SGLang and vLLM.

Example use cases

Multi‑LLM serving Multi‑LLM serving
kvcached allows multiple LLMs to share a GPU's memory elastically, enabling concurrent deployment without the rigid memory partitioning used today. This improves GPU utilization and saves serving costs.
Serverless LLM Serverless LLM
By allocating KV cache only when needed, kvcached supports serverless deployments where models can spin up and down on demand.
Compound AI systems Compound AI systems
kvcached makes compound AI systems practical on limited hardware by elastically allocating memory across specialized models in a pipeline (e.g., retrieval, reasoning, and summarization).
GPU workload colocation GPU workload colocation
kvcached allows LLM inference to coexist with other GPU workloads such as training jobs, fine-tuning, or vision models.

See concrete examples here: kvcached/examples.

kvcached in action

The following simple example shows how kvcached enables an unmodified vLLM engine run with dynamically allocated memory.

kvcached in action

Performance: Multi-LLM serving

kvcached enables dynamic memory sharing between LLMs, allowing them to share the same GPU memory elastically. As a comparison, the current serving engines need to statically reserve GPU memory at startup.

This benchmark shows the performance benefits of kvcached when serving three Llama-3.1-8B models on an A100-80G GPU under workloads with intermittent peaks. kvcached can achieve 2-28x TTFT reduction compared to the current serving engines. This performance gain can be converted to significant cost savings for LLM serving. Without kvcached, the systems have to provision more GPUs to achieve the same performance. Details can be found in benchmarks/bench_latency_benefit.

TTFT mean TTFT p99

Installation

Prerequisites

  • Python (tested with 3.9 - 3.12)
  • SGLang (tested with v0.5.3) or vLLM (tested with v0.11.0)

kvcached can be installed as a plugin with existing SGLang or vLLM environment.

Install from PyPI

pip install kvcached --no-build-isolation

Install from source

# under the project root folder

pip install -e . --no-build-isolation --no-cache-dir
python tools/dev_copy_pth.py

Using Docker

kvcached installed with original engine dockers.

docker pull ghcr.io/ovg-project/kvcached-sglang:latest   # kvcached-v0.1.1-sglang-v0.5.3
docker pull ghcr.io/ovg-project/kvcached-vllm:latest     # kvcached-v0.1.1-vllm-v0.11.0

We prepare an all-in-one docker for developers:

docker pull ghcr.io/ovg-project/kvcached-dev:latest

More instructions can be found here. GB200 dockers are on the way.

Testing

kvcached can be enabled by setting the following environmental variables:

export ENABLE_KVCACHED=true
export KVCACHED_AUTOPATCH=1

# memory stats ipc name (optional)
export KVCACHED_IPC_NAME=[SGLANG|VLLM]

If you are using the engine-specific dockers, you can test kvcached by running the original engines' benchmark scripts. For example:

# for sglang
python -m sglang.launch_server --model meta-llama/Llama-3.2-1B --disable-radix-cache --port 30000
python -m sglang.bench_serving --backend sglang-oai --model meta-llama/Llama-3.2-1B --dataset-name sharegpt --request-rate 10 --num-prompts 1000 --port 30000

# for vllm
vllm serve meta-llama/Llama-3.2-1B --disable-log-requests --no-enable-prefix-caching --port=12346
vllm bench serve --model meta-llama/Llama-3.2-1B --request-rate 10 --num-prompts 1000 --port 12346

If you installed kvcached using its source code, you can also do the following:

cd benchmarks/simple_bench
./start_server.sh [sglang|vllm] --venv-path $VENV_PATH --model meta-llama/Llama-3.2-1B
# Wait until LLM server is ready
./start_client.sh [sglang|vllm] --venv-path $VENV_PATH --model meta-llama/Llama-3.2-1B

The benchmark scripts automatically set ENABLE_KVCACHED=true. Please refer to each script for instructions on how to run inference with kvcached.

Roadmap

The latest roadmap is also tracked in issue #125.

  • Engine integration
    • SGLang and vLLM
    • Ollama (in progress)
    • llama.cpp and LMStudio
  • Features
    • Tensor parallelism
    • Prefix caching
    • KV cache offloading to host memory
    • More attention types (sliding window attention, linear attention, vision encoder, etc.)
  • Performance optimizations
    • Contiguous KV tensor layout
    • Physical memory management
  • Hardware
    • NVIDIA GPUs
    • AMD GPUs

Contributing

We are grateful for and open to contributions and collaborations of any kind.

We use pre-commit to ensure a consistent coding style. You can set it up by

pip install pre-commit
pre-commit install

Before pushing your code, please run the following check and make sure your code passes all checks.

pre-commit run --all-files

Contacts

kvcached is developed by many contributors from the community. Feel free to contact us for contributions and collaborations.

Jiarong Xing ([email protected])
Yifan Qiao ([email protected])
Shan Yu ([email protected])

Citation

If you find kvcached useful, please cite our paper:

@article{xing2025towards,
  title={Towards Efficient and Practical GPU Multitasking in the Era of LLM},
  author={Xing, Jiarong and Qiao, Yifan and Mo, Simon and Cui, Xingqi and Sela, Gur-Eyal and Zhou, Yang and Gonzalez, Joseph and Stoica, Ion},
  journal={arXiv preprint arXiv:2508.08448},
  year={2025}
}

@article{yu2025prism,
  title={Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving},
  author={Yu, Shan and Xing, Jiarong and Qiao, Yifan and Ma, Mingyuan and Li, Yangmin and Wang, Yang and Yang, Shuo and Xie, Zhiqiang and Cao, Shiyi and Bao, Ke and others},
  journal={arXiv preprint arXiv:2505.04021},
  year={2025}
}