About | Core Innovations | Architecture | Project Structure | Installation | Examples | SGLang | kvcached
🚀 Prism is a multi-LLM serving system that achieves >2× cost savings and 3.3× more SLO attainment through flexible GPU sharing.
Serving multiple large language models (LLMs) raises cost and performance challenges. Today's systems usually dedicate one or a group of GPUs to a specific model, leading to low GPU utilization.
Prism tackles this challenge through flexible GPU sharing, enabling multiple models to share one or more GPUs via time-sharing or space-sharing. To meet latency service-level objectives (SLOs), it employs a scheduling algorithm that dynamically adjusts the sharing policy based on runtime workload patterns. Compared to existing systems, Prism delivers over 2× cost savings and a 3.3× improvement in SLO attainment.
Prism uses kvcached for flexible memory sharing and implements its system on top of SGLang.
Prism introduces two fundamental innovations:
🔧 Flexible Cross-Model Memory Coordination
- On-demand memory allocation: kvcached decouples virtual and physical GPU memory allocation, enabling dynamic memory redistribution across models without engine modifications.
- Fast model activation: Prism supports warm-start through pre-initialized SGLang engines. It also supports parallel model weight loading. Together, this reduces the model activation time (tested 1B to 70B) to <1.5s.
📊 Two-Level Demand-Aware Scheduling
- Global scheduler: Smart model placement across GPUs to balance the load for better performance.
- Local scheduler: Coordinates memory allocation among colocated models using priority-based admission control.
Prism enhances SGLang with flexible GPU sharing capabilities through a unified multi-component architecture:
Prism extends SGLang with comprehensive multi-model serving capabilities. The key modifications include:
Multi-LLM Serving with Two-Level Workload-aware Scheduling
python/sglang/
├── launch_multi_model_server.py # Main entry point for multi-model server
└── multi_model/ # Complete multi-model serving implementation
├── scheduling/
│ ├── policy/ # Global scheduling algorithms
│ ├── gpu/ # GPU scheduling & resource monitoring
│ └── ... # Additional scheduling components
├── endpoint.py # Multi-model API endpoints
├── engine.py # Multi-model engine coordination
├── model_service.py # Model lifecycle management
├── multi_model_server.py # Core server implementation
├── request_handler.py # Request routing and processing
└── ... # Additional server infrastructure
Enhanced SGLang Runtime with Elastic LLM Engine
python/sglang/srt/
├── managers/
│ ├── scheduler.py # 🔧 Enhanced with multi-model scheduling
│ └── ... # Other enhanced managers
├── model_executor/ # 🔧 Worker pool & execution enhancements
├── mem_cache/ # 🔧 Memory pool & elastic allocation
├── server_args.py # 🔧 Multi-model server arguments
└── ... # Additional runtime modifications
Benchmarking & Evaluation
benchmark/multi-model/
├── benchmark.py # Multi-model workload benchmarking
├── trace.py # Synthetic & real-world trace generation
├── model_configs/ # Various model configuration setups
└── ... # Additional benchmarking tools & code
For detailed installation instructions and benchmarking setup, please refer to install.md.
Prism offers three deployment modes, each building upon the previous with enhanced capabilities:
Colocate LLMs with Static Memory Allocation
Launch server with static memory allocation:
# Navigate to benchmark directory
cd benchmark/multi-model
# Start server with static configuration
python3 -m sglang.launch_multi_model_server \
--model-config-file ./model_configs/1_gpu_2_model_colocate_static.json \
--host 127.0.0.1 \
--port 30000 \
--disable-cuda-graph \
--disable-radix-cache \
--load-format dummy \
--log-file server-logs/static.logRun synthetic trace benchmark:
python3 benchmark.py \
--base-url http://127.0.0.1:30000 \
--num-models 2 \
--model-paths model_1 model_2 \
--exp-name static_baseline \
--req-rate 10 \
--seed 42Colocate LLMs with Elastic Memory Allocation
Launch server with Prism's elastic memory management:
# Start server with elastic kvcached
python3 -m sglang.launch_multi_model_server \
--model-config-file ./model_configs/1_gpu_2_model_colocate_elastic.json \
--host 127.0.0.1 \
--port 30001 \
--disable-cuda-graph \
--disable-radix-cache \
--enable-elastic-memory \
--use-kvcached-v0 \
--log-file server-logs/elastic.logRun with model switching:
python3 benchmark.py \
--base-url http://127.0.0.1:30001 \
--num-models 2 \
--model-paths model_1 model_2 \
--exp-name elastic_memory \
--enable-elastic-memory \
--req-rate 10 \
--seed 42Flexible Time and Space Sharing (Full Prism)
Launch server with complete Prism system:
# Start server with full Prism capabilities
python3 -m sglang.launch_multi_model_server \
--model-config-file ./model_configs/8_gpu_18_model_our.json \
--port 30002 \
--disable-cuda-graph \
--disable-radix-cache \
--enable-cpu-share-memory
--enable-elastic-memory \
--use-kvcached-v0 \
--max-mem-usage 67.28 \
--enable-gpu-scheduler \
--enable-controller \
--policy simple-global \
--enable-model-service \
--enable-worker-pool \
--workers-per-gpu 4 \
--num-model-service-workers 4 \
--num-gpus 8 \
--log-file server-logs/workerpool.logRun large-scale benchmark:
python3 benchmark.py \
--base-url http://127.0.0.1:30002 \
--num-models 18 \
--num-gpus 8 \
--exp-name prism_full \
--e2e-benchmark \
--real-trace ./real_trace.pkl \
--time-scale 1 \
--replication 1Model placement configuration
Prism launches LLMs based on an initial placement file. The placement file is a JSON file that specifies the model name, model path, and the GPU IDs on which the model should be placed.
Below are some examples of the initial model placements.
Colocate two models on GPU 0:
[
{
"model_name": "model_1",
"model_path": "meta-llama/Llama-3.2-1B",
"tp_size": 1,
"init_placements": [{
"gpu_ids": [0],
"on": true,
"max_memory_pool_size": 15
}]
},
{
"model_name": "model_2",
"model_path": "meta-llama/Llama-3.2-3B",
"tp_size": 1,
"init_placements": [{
"gpu_ids": [0],
"on": false,
"max_memory_pool_size": 15
}]
}
]Load 70B model across 4 GPUs:
[
{
"model_name": "large_model",
"model_path": "meta-llama/Llama-3.3-70B-Instruct",
"tp_size": 4,
"init_placements": [{
"gpu_ids": [0, 1, 2, 3],
"on": true,
"max_memory_pool_size": 10
}]
}
]For more configuration examples, see benchmark/multi-model/model_configs/.
If you find Prism useful for your research, please cite our paper:
@misc{yu2025prismunleashinggpusharing,
title={Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving},
author={Shan Yu and Jiarong Xing and Yifan Qiao and Mingyuan Ma and Yangmin Li and Yang Wang and Shuo Yang and Zhiqiang Xie and Shiyi Cao and Ke Bao and Ion Stoica and Harry Xu and Ying Sheng},
year={2025},
eprint={2505.04021},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2505.04021}
}