Cost-Efficient Multi-LLM Inference

🚀 Prism is a multi-LLM serving system that achieves >2× cost savings and 3.3× more SLO attainment through flexible GPU sharing.

About

Serving multiple large language models (LLMs) raises cost and performance challenges. Today's systems usually dedicate one or a group of GPUs to a specific model, leading to low GPU utilization.

Prism tackles this challenge through flexible GPU sharing, enabling multiple models to share one or more GPUs via time-sharing or space-sharing. To meet latency service-level objectives (SLOs), it employs a scheduling algorithm that dynamically adjusts the sharing policy based on runtime workload patterns. Compared to existing systems, Prism delivers over 2× cost savings and a 3.3× improvement in SLO attainment.

Prism uses kvcached for flexible memory sharing and implements its system on top of SGLang.

Core Innovations

Prism introduces two fundamental innovations:

🔧 Flexible Cross-Model Memory Coordination

On-demand memory allocation: kvcached decouples virtual and physical GPU memory allocation, enabling dynamic memory redistribution across models without engine modifications.
Fast model activation: Prism supports warm-start through pre-initialized SGLang engines. It also supports parallel model weight loading. Together, this reduces the model activation time (tested 1B to 70B) to <1.5s.

📊 Two-Level Demand-Aware Scheduling

Global scheduler: Smart model placement across GPUs to balance the load for better performance.
Local scheduler: Coordinates memory allocation among colocated models using priority-based admission control.

Architecture

Prism enhances SGLang with flexible GPU sharing capabilities through a unified multi-component architecture:

Project Structure

Prism extends SGLang with comprehensive multi-model serving capabilities. The key modifications include:

Multi-LLM Serving with Two-Level Workload-aware Scheduling

python/sglang/
├── launch_multi_model_server.py    # Main entry point for multi-model server
└── multi_model/                    # Complete multi-model serving implementation
    ├── scheduling/
    │   ├── policy/                  # Global scheduling algorithms
    │   ├── gpu/                     # GPU scheduling & resource monitoring
    │   └── ...                      # Additional scheduling components
    ├── endpoint.py                  # Multi-model API endpoints
    ├── engine.py                    # Multi-model engine coordination
    ├── model_service.py             # Model lifecycle management
    ├── multi_model_server.py        # Core server implementation
    ├── request_handler.py           # Request routing and processing
    └── ...                          # Additional server infrastructure

Enhanced SGLang Runtime with Elastic LLM Engine

python/sglang/srt/
├── managers/
│   ├── scheduler.py                 # 🔧 Enhanced with multi-model scheduling
│   └── ...                          # Other enhanced managers
├── model_executor/                  # 🔧 Worker pool & execution enhancements
├── mem_cache/                       # 🔧 Memory pool & elastic allocation
├── server_args.py                   # 🔧 Multi-model server arguments
└── ...                              # Additional runtime modifications

Benchmarking & Evaluation

benchmark/multi-model/
├── benchmark.py                     # Multi-model workload benchmarking
├── trace.py                         # Synthetic & real-world trace generation
├── model_configs/                    # Various model configuration setups
└── ...                              # Additional benchmarking tools & code

Installation

For detailed installation instructions and benchmarking setup, please refer to install.md.

Examples

Prism offers three deployment modes, each building upon the previous with enhanced capabilities:

Colocate LLMs with Static Memory Allocation

Launch server with static memory allocation:

# Navigate to benchmark directory
cd benchmark/multi-model

# Start server with static configuration
python3 -m sglang.launch_multi_model_server \
  --model-config-file ./model_configs/1_gpu_2_model_colocate_static.json \
  --host 127.0.0.1 \
  --port 30000 \
  --disable-cuda-graph \
  --disable-radix-cache \
  --load-format dummy \
  --log-file server-logs/static.log

Run synthetic trace benchmark:

python3 benchmark.py \
  --base-url http://127.0.0.1:30000 \
  --num-models 2 \
  --model-paths model_1 model_2 \
  --exp-name static_baseline \
  --req-rate 10 \
  --seed 42

Colocate LLMs with Elastic Memory Allocation

Launch server with Prism's elastic memory management:

# Start server with elastic kvcached
python3 -m sglang.launch_multi_model_server \
  --model-config-file ./model_configs/1_gpu_2_model_colocate_elastic.json \
  --host 127.0.0.1 \
  --port 30001 \
  --disable-cuda-graph \
  --disable-radix-cache \
  --enable-elastic-memory \
  --use-kvcached-v0 \
  --log-file server-logs/elastic.log

Run with model switching:

python3 benchmark.py \
  --base-url http://127.0.0.1:30001 \
  --num-models 2 \
  --model-paths model_1 model_2 \
  --exp-name elastic_memory \
  --enable-elastic-memory \
  --req-rate 10 \
  --seed 42

Flexible Time and Space Sharing (Full Prism)

Launch server with complete Prism system:

# Start server with full Prism capabilities
python3 -m sglang.launch_multi_model_server \
  --model-config-file ./model_configs/8_gpu_18_model_our.json \
  --port 30002 \
  --disable-cuda-graph \
  --disable-radix-cache \
  --enable-cpu-share-memory
  --enable-elastic-memory \
  --use-kvcached-v0 \
  --max-mem-usage 67.28 \
  --enable-gpu-scheduler \
  --enable-controller \
  --policy simple-global \
  --enable-model-service \
  --enable-worker-pool \
  --workers-per-gpu 4 \
  --num-model-service-workers 4 \
  --num-gpus 8 \
  --log-file server-logs/workerpool.log

Run large-scale benchmark:

python3 benchmark.py \
  --base-url http://127.0.0.1:30002 \
  --num-models 18 \
  --num-gpus 8 \
  --exp-name prism_full \
  --e2e-benchmark \
  --real-trace ./real_trace.pkl \
  --time-scale 1 \
  --replication 1

Configuration Guide

Model placement configuration

Prism launches LLMs based on an initial placement file. The placement file is a JSON file that specifies the model name, model path, and the GPU IDs on which the model should be placed.

Below are some examples of the initial model placements.

Colocate two models on GPU 0:

[
  {
    "model_name": "model_1",
    "model_path": "meta-llama/Llama-3.2-1B",
    "tp_size": 1,
    "init_placements": [{
      "gpu_ids": [0],
      "on": true,
      "max_memory_pool_size": 15
    }]
  },
  {
    "model_name": "model_2", 
    "model_path": "meta-llama/Llama-3.2-3B",
    "tp_size": 1,
    "init_placements": [{
      "gpu_ids": [0],
      "on": false,
      "max_memory_pool_size": 15
    }]
  }
]

Load 70B model across 4 GPUs:

[
  {
    "model_name": "large_model",
    "model_path": "meta-llama/Llama-3.3-70B-Instruct", 
    "tp_size": 4,
    "init_placements": [{
      "gpu_ids": [0, 1, 2, 3],
      "on": true,
      "max_memory_pool_size": 10
    }]
  }
]

For more configuration examples, see benchmark/multi-model/model_configs/.

Citation

If you find Prism useful for your research, please cite our paper:

@misc{yu2025prismunleashinggpusharing,
  title={Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving},
  author={Shan Yu and Jiarong Xing and Yifan Qiao and Mingyuan Ma and Yangmin Li and Yang Wang and Shuo Yang and Zhiqiang Xie and Shiyi Cao and Ke Bao and Ion Stoica and Harry Xu and Ying Sheng},
  year={2025},
  eprint={2505.04021},
  archivePrefix={arXiv},
  primaryClass={cs.DC},
  url={https://arxiv.org/abs/2505.04021}
}

Back To Top

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
assets		assets
benchmark		benchmark
docker		docker
docs		docs
examples		examples
experiments		experiments
pic		pic
python		python
scripts		scripts
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
gpu_util_pencentile.py		gpu_util_pencentile.py
install.md		install.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cost-Efficient Multi-LLM Inference

About

Core Innovations

Architecture

Project Structure

Installation

Examples

Configuration Guide

Citation

About

Uh oh!

Releases

Packages

Languages

License

Multi-LLM/prism-research

Folders and files

Latest commit

History

Repository files navigation

Cost-Efficient Multi-LLM Inference

About

Core Innovations

Architecture

Project Structure

Installation

Examples

Configuration Guide

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages