Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Launch multiple api servers for dp > 1 #3414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

RunningLeon
Copy link
Collaborator

@RunningLeon RunningLeon commented Apr 10, 2025

Motivation

Support launch multiple api servers for dp > 1

Usage

Example for two nodes with ep=16, dp=16

Step 1: Launch proxy server on master node

lmdeploy serve proxy --server-port 23333 --server-name 172.16.4.52

Step 2: Launch api servers on master node

LMDEPLOY_DP_MASTER_ADDR=172.16.4.52 \
LMDEPLOY_DP_MASTER_PORT=29555 \
lmdeploy serve api_server \
    deepseek-ai/DeepSeek-V3 \
    --backend pytorch \
    --ep 16 \
    --dp 16  \
    --proxy-url http://172.16.4.52:23333 \
    --nnodes 2 \
    --node-rank 0

Step 3: Launch api servers on slave node

LMDEPLOY_DP_MASTER_ADDR=172.16.4.52 \
LMDEPLOY_DP_MASTER_PORT=29555 \
lmdeploy serve api_server \
    deepseek-ai/DeepSeek-V3 \
    --backend pytorch \
    --ep 16 \
    --dp 16  \
    --proxy-url http://172.16.4.52:23333 \
    --nnodes 2 \
    --node-rank 1

step 4 Query on proxy server

curl http://172.16.4.52:23333/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3",
    "messages": [{"role": "user", "content": "Hello! How are you?"}]
  }'

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

None

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

@CUHKSZzxy
Copy link
Collaborator

Failed with tp1, dp 32, ep 32, error info:

2025-04-11 04:15:17,172 - lmdeploy - �[37mINFO�[0m - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=0, dp=32, dp_rank=0, ep=32, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='cuda', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None)
2025-04-11 04:15:17,172 - lmdeploy - �[37mINFO�[0m - async_engine.py:260 - input chat_template_config=None
2025-04-11 04:15:17,173 - lmdeploy - �[37mINFO�[0m - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=0, dp=32, dp_rank=3, ep=32, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='cuda', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None)
2025-04-11 04:15:17,173 - lmdeploy - �[37mINFO�[0m - async_engine.py:260 - input chat_template_config=None
2025-04-11 04:15:17,219 - lmdeploy - �[37mINFO�[0m - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=0, dp=32, dp_rank=4, ep=32, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='cuda', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None)
2025-04-11 04:15:17,219 - lmdeploy - �[37mINFO�[0m - async_engine.py:260 - input chat_template_config=None
2025-04-11 04:15:17,239 - lmdeploy - �[37mINFO�[0m - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=0, dp=32, dp_rank=2, ep=32, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='cuda', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None)
2025-04-11 04:15:17,239 - lmdeploy - �[37mINFO�[0m - async_engine.py:260 - input chat_template_config=None
2025-04-11 04:15:17,251 - lmdeploy - �[37mINFO�[0m - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=0, dp=32, dp_rank=1, ep=32, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='cuda', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None)
2025-04-11 04:15:17,251 - lmdeploy - �[37mINFO�[0m - async_engine.py:260 - input chat_template_config=None
2025-04-11 04:15:17,290 - lmdeploy - �[37mINFO�[0m - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=0, dp=32, dp_rank=6, ep=32, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='cuda', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None)
2025-04-11 04:15:17,290 - lmdeploy - �[37mINFO�[0m - async_engine.py:260 - input chat_template_config=None
2025-04-11 04:15:17,290 - lmdeploy - �[37mINFO�[0m - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=0, dp=32, dp_rank=5, ep=32, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='cuda', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None)
2025-04-11 04:15:17,290 - lmdeploy - �[37mINFO�[0m - async_engine.py:260 - input chat_template_config=None
2025-04-11 04:15:17,295 - lmdeploy - �[37mINFO�[0m - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=0, dp=32, dp_rank=7, ep=32, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='cuda', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None)
2025-04-11 04:15:17,296 - lmdeploy - �[37mINFO�[0m - async_engine.py:260 - input chat_template_config=None
2025-04-11 04:15:18,445 - lmdeploy - �[37mINFO�[0m - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='deepseek-v3', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-04-11 04:15:18,461 - lmdeploy - �[37mINFO�[0m - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='deepseek-v3', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-04-11 04:15:19,281 - lmdeploy - �[37mINFO�[0m - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='deepseek-v3', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-04-11 04:15:19,284 - lmdeploy - �[37mINFO�[0m - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='deepseek-v3', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-04-11 04:15:19,327 - lmdeploy - �[37mINFO�[0m - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='deepseek-v3', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-04-11 04:15:19,341 - lmdeploy - �[37mINFO�[0m - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='deepseek-v3', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-04-11 04:15:19,350 - lmdeploy - �[37mINFO�[0m - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='deepseek-v3', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-04-11 04:15:19,367 - lmdeploy - �[37mINFO�[0m - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='deepseek-v3', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-04-11 04:15:19,476 - lmdeploy - �[31mERROR�[0m - base.py:53 - RuntimeError: No CUDA GPUs are available
2025-04-11 04:15:19,476 - lmdeploy - �[31mERROR�[0m - base.py:54 - <PyTorch> check failed!
�[31mPyTorch is not available.�[0m
2025-04-11 04:15:19,477 - lmdeploy - �[31mERROR�[0m - base.py:53 - RuntimeError: No CUDA GPUs are available
2025-04-11 04:15:19,477 - lmdeploy - �[31mERROR�[0m - base.py:54 - <PyTorch> check failed!
�[31mPyTorch is not available.�[0m
2025-04-11 04:15:20,287 - lmdeploy - �[31mERROR�[0m - base.py:53 - RuntimeError: No CUDA GPUs are available
2025-04-11 04:15:20,287 - lmdeploy - �[31mERROR�[0m - base.py:54 - <PyTorch> check failed!
�[31mPyTorch is not available.�[0m
2025-04-11 04:15:20,331 - lmdeploy - �[31mERROR�[0m - base.py:53 - RuntimeError: No CUDA GPUs are available
2025-04-11 04:15:20,331 - lmdeploy - �[31mERROR�[0m - base.py:54 - <PyTorch> check failed!
�[31mPyTorch is not available.�[0m
2025-04-11 04:15:20,386 - lmdeploy - �[31mERROR�[0m - base.py:53 - RuntimeError: No CUDA GPUs are available
2025-04-11 04:15:20,386 - lmdeploy - �[31mERROR�[0m - base.py:54 - <PyTorch> check failed!
�[31mPyTorch is not available.�[0m
2025-04-11 04:15:20,467 - lmdeploy - �[31mERROR�[0m - base.py:53 - RuntimeError: No CUDA GPUs are available
2025-04-11 04:15:20,468 - lmdeploy - �[31mERROR�[0m - base.py:54 - <PyTorch> check failed!
�[31mPyTorch is not available.�[0m
2025-04-11 04:15:20,470 - lmdeploy - �[31mERROR�[0m - base.py:53 - RuntimeError: No CUDA GPUs are available
2025-04-11 04:15:20,470 - lmdeploy - �[31mERROR�[0m - base.py:54 - <PyTorch> check failed!
�[31mPyTorch is not available.�[0m
2025-04-11 04:15:20,483 - lmdeploy - �[31mERROR�[0m - base.py:53 - RuntimeError: No CUDA GPUs are available
2025-04-11 04:15:20,483 - lmdeploy - �[31mERROR�[0m - base.py:54 - <PyTorch> check failed!
�[31mPyTorch is not available.�[0m
HINT:    Please open �[93m�[1mhttp://10.130.8.145:8000�[0m in a browser for detailed api usage!!!

I have checked CUDA as follows, and nvidia-smi works well
cuda_check

Reproduce steps:

Proxy server

lmdeploy serve proxy --server-name xxx --server-port 8000 --strategy "min_expected_latency" --log-level INFO

API serve

# node0
LMDEPLOY_DP_MASTER_ADDR=xxx \
LMDEPLOY_DP_MASTER_PORT=29555 \
lmdeploy serve api_server \
    deepseek-ai/DeepSeek-V3 \
    --backend pytorch \
    --tp 1 \
    --dp 32 \
    --ep 32 \
    --proxy-url http://xxx:8000 \
    --nnodes 4 \
    --node-rank 0 --log-level INFO

# node1
LMDEPLOY_DP_MASTER_ADDR=xxx \
LMDEPLOY_DP_MASTER_PORT=29555 \
lmdeploy serve api_server \
    deepseek-ai/DeepSeek-V3 \
    --backend pytorch \
    --tp 1 \
    --dp 32 \
    --ep 32 \
    --proxy-url http://xxx:8000 \
    --nnodes 4 \
    --node-rank 1 --log-level INFO

# node2
LMDEPLOY_DP_MASTER_ADDR=xxx \
LMDEPLOY_DP_MASTER_PORT=29555 \
lmdeploy serve api_server \
    deepseek-ai/DeepSeek-V3 \
    --backend pytorch \
    --tp 1 \
    --dp 32 \
    --ep 32 \
    --proxy-url http://xxx:8000 \
    --nnodes 4 \
    --node-rank 2 --log-level INFO

# node3
LMDEPLOY_DP_MASTER_ADDR=xxx \
LMDEPLOY_DP_MASTER_PORT=29555 \
lmdeploy serve api_server \
    deepseek-ai/DeepSeek-V3 \
    --backend pytorch \
    --tp 1 \
    --dp 32 \
    --ep 32 \
    --proxy-url http://xxx:8000 \
    --nnodes 4 \
    --node-rank 3 --log-level INFO

@RunningLeon
Copy link
Collaborator Author

Failed with tp1, dp 32, ep 32, error info:

2025-04-11 04:15:17,172 - lmdeploy - �[37mINFO�[0m - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=0, dp=32, dp_rank=0, ep=32, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='cuda', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None)
2025-04-11 04:15:17,172 - lmdeploy - �[37mINFO�[0m - async_engine.py:260 - input chat_template_config=None
2025-04-11 04:15:17,173 - lmdeploy - �[37mINFO�[0m - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=0, dp=32, dp_rank=3, ep=32, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='cuda', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None)
2025-04-11 04:15:17,173 - lmdeploy - �[37mINFO�[0m - async_engine.py:260 - input chat_template_config=None
2025-04-11 04:15:17,219 - lmdeploy - �[37mINFO�[0m - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=0, dp=32, dp_rank=4, ep=32, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='cuda', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None)
2025-04-11 04:15:17,219 - lmdeploy - �[37mINFO�[0m - async_engine.py:260 - input chat_template_config=None
2025-04-11 04:15:17,239 - lmdeploy - �[37mINFO�[0m - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=0, dp=32, dp_rank=2, ep=32, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='cuda', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None)
2025-04-11 04:15:17,239 - lmdeploy - �[37mINFO�[0m - async_engine.py:260 - input chat_template_config=None
2025-04-11 04:15:17,251 - lmdeploy - �[37mINFO�[0m - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=0, dp=32, dp_rank=1, ep=32, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='cuda', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None)
2025-04-11 04:15:17,251 - lmdeploy - �[37mINFO�[0m - async_engine.py:260 - input chat_template_config=None
2025-04-11 04:15:17,290 - lmdeploy - �[37mINFO�[0m - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=0, dp=32, dp_rank=6, ep=32, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='cuda', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None)
2025-04-11 04:15:17,290 - lmdeploy - �[37mINFO�[0m - async_engine.py:260 - input chat_template_config=None
2025-04-11 04:15:17,290 - lmdeploy - �[37mINFO�[0m - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=0, dp=32, dp_rank=5, ep=32, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='cuda', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None)
2025-04-11 04:15:17,290 - lmdeploy - �[37mINFO�[0m - async_engine.py:260 - input chat_template_config=None
2025-04-11 04:15:17,295 - lmdeploy - �[37mINFO�[0m - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=0, dp=32, dp_rank=7, ep=32, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='cuda', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None)
2025-04-11 04:15:17,296 - lmdeploy - �[37mINFO�[0m - async_engine.py:260 - input chat_template_config=None
2025-04-11 04:15:18,445 - lmdeploy - �[37mINFO�[0m - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='deepseek-v3', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-04-11 04:15:18,461 - lmdeploy - �[37mINFO�[0m - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='deepseek-v3', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-04-11 04:15:19,281 - lmdeploy - �[37mINFO�[0m - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='deepseek-v3', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-04-11 04:15:19,284 - lmdeploy - �[37mINFO�[0m - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='deepseek-v3', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-04-11 04:15:19,327 - lmdeploy - �[37mINFO�[0m - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='deepseek-v3', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-04-11 04:15:19,341 - lmdeploy - �[37mINFO�[0m - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='deepseek-v3', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-04-11 04:15:19,350 - lmdeploy - �[37mINFO�[0m - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='deepseek-v3', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-04-11 04:15:19,367 - lmdeploy - �[37mINFO�[0m - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='deepseek-v3', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-04-11 04:15:19,476 - lmdeploy - �[31mERROR�[0m - base.py:53 - RuntimeError: No CUDA GPUs are available
2025-04-11 04:15:19,476 - lmdeploy - �[31mERROR�[0m - base.py:54 - <PyTorch> check failed!
�[31mPyTorch is not available.�[0m
2025-04-11 04:15:19,477 - lmdeploy - �[31mERROR�[0m - base.py:53 - RuntimeError: No CUDA GPUs are available
2025-04-11 04:15:19,477 - lmdeploy - �[31mERROR�[0m - base.py:54 - <PyTorch> check failed!
�[31mPyTorch is not available.�[0m
2025-04-11 04:15:20,287 - lmdeploy - �[31mERROR�[0m - base.py:53 - RuntimeError: No CUDA GPUs are available
2025-04-11 04:15:20,287 - lmdeploy - �[31mERROR�[0m - base.py:54 - <PyTorch> check failed!
�[31mPyTorch is not available.�[0m
2025-04-11 04:15:20,331 - lmdeploy - �[31mERROR�[0m - base.py:53 - RuntimeError: No CUDA GPUs are available
2025-04-11 04:15:20,331 - lmdeploy - �[31mERROR�[0m - base.py:54 - <PyTorch> check failed!
�[31mPyTorch is not available.�[0m
2025-04-11 04:15:20,386 - lmdeploy - �[31mERROR�[0m - base.py:53 - RuntimeError: No CUDA GPUs are available
2025-04-11 04:15:20,386 - lmdeploy - �[31mERROR�[0m - base.py:54 - <PyTorch> check failed!
�[31mPyTorch is not available.�[0m
2025-04-11 04:15:20,467 - lmdeploy - �[31mERROR�[0m - base.py:53 - RuntimeError: No CUDA GPUs are available
2025-04-11 04:15:20,468 - lmdeploy - �[31mERROR�[0m - base.py:54 - <PyTorch> check failed!
�[31mPyTorch is not available.�[0m
2025-04-11 04:15:20,470 - lmdeploy - �[31mERROR�[0m - base.py:53 - RuntimeError: No CUDA GPUs are available
2025-04-11 04:15:20,470 - lmdeploy - �[31mERROR�[0m - base.py:54 - <PyTorch> check failed!
�[31mPyTorch is not available.�[0m
2025-04-11 04:15:20,483 - lmdeploy - �[31mERROR�[0m - base.py:53 - RuntimeError: No CUDA GPUs are available
2025-04-11 04:15:20,483 - lmdeploy - �[31mERROR�[0m - base.py:54 - <PyTorch> check failed!
�[31mPyTorch is not available.�[0m
HINT:    Please open �[93m�[1mhttp://10.130.8.145:8000�[0m in a browser for detailed api usage!!!

I have checked CUDA as follows, and nvidia-smi works well cuda_check

Reproduce steps:

Proxy server

lmdeploy serve proxy --server-name xxx --server-port 8000 --strategy "min_expected_latency" --log-level INFO

API serve

# node0
LMDEPLOY_DP_MASTER_ADDR=xxx \
LMDEPLOY_DP_MASTER_PORT=29555 \
lmdeploy serve api_server \
    deepseek-ai/DeepSeek-V3 \
    --backend pytorch \
    --tp 1 \
    --dp 32 \
    --ep 32 \
    --proxy-url http://xxx:8000 \
    --nnodes 4 \
    --node-rank 0 --log-level INFO

# node1
LMDEPLOY_DP_MASTER_ADDR=xxx \
LMDEPLOY_DP_MASTER_PORT=29555 \
lmdeploy serve api_server \
    deepseek-ai/DeepSeek-V3 \
    --backend pytorch \
    --tp 1 \
    --dp 32 \
    --ep 32 \
    --proxy-url http://xxx:8000 \
    --nnodes 4 \
    --node-rank 1 --log-level INFO

# node2
LMDEPLOY_DP_MASTER_ADDR=xxx \
LMDEPLOY_DP_MASTER_PORT=29555 \
lmdeploy serve api_server \
    deepseek-ai/DeepSeek-V3 \
    --backend pytorch \
    --tp 1 \
    --dp 32 \
    --ep 32 \
    --proxy-url http://xxx:8000 \
    --nnodes 4 \
    --node-rank 2 --log-level INFO

# node3
LMDEPLOY_DP_MASTER_ADDR=xxx \
LMDEPLOY_DP_MASTER_PORT=29555 \
lmdeploy serve api_server \
    deepseek-ai/DeepSeek-V3 \
    --backend pytorch \
    --tp 1 \
    --dp 32 \
    --ep 32 \
    --proxy-url http://xxx:8000 \
    --nnodes 4 \
    --node-rank 3 --log-level INFO

@CUHKSZzxy can you try tp=32 as it is for all nodes.

@RunningLeon RunningLeon changed the title Launch multiple api servers for dp > 1 [WIP]: Launch multiple api servers for dp > 1 Apr 15, 2025
@RunningLeon RunningLeon force-pushed the launch-multi-servers branch from 7f47b4f to e28f8dc Compare April 22, 2025 12:39
@RunningLeon RunningLeon changed the title [WIP]: Launch multiple api servers for dp > 1 Launch multiple api servers for dp > 1 Apr 22, 2025
@RunningLeon RunningLeon requested review from CUHKSZzxy and removed request for CUHKSZzxy April 25, 2025 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants