-
Notifications
You must be signed in to change notification settings - Fork 606
LMDeploy Distserve #3304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LMDeploy Distserve #3304
Changes from 1 commit
97d6d5d
3241c1a
1788a28
03b363f
aabb72b
3ba605f
2e6ee7a
cdf55c1
ace6ece
481052e
f9b7409
60032b6
aa43faa
97e4430
1e6c4da
290e606
b530384
efcb72c
a3d973b
31fd9f3
48d791a
2f02e05
ae959a0
11d9961
18da0fb
a478c77
c490de4
df3f9ef
61ad2a7
ad27c3a
1c3b20c
119059f
1f220d4
0a58979
83838d8
b108752
74d9256
39b2c4f
65ba59f
3af751b
6028ec2
3047e7b
649b51e
531524a
ce660ca
957bd68
f6de868
7437bfa
b0a8f1f
a7bb7c4
d488d87
b626d9e
2d6f8c1
fec61ba
2637091
3dedc69
c09a06b
160cb3c
e97a486
0eb588a
a048dfd
506bdb2
4e0f31d
3f53e64
b70fc44
6498133
8d89f55
4ac8f37
d858e81
6741c48
10a70c9
c9d9e13
d292bf5
70dc438
2c54627
82a0a58
ab4a5b9
c8212e3
0e83d26
5312fac
53091e3
4af8d3d
76c3a04
5f10df9
25f3488
2c70c55
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,5 @@ | ||
# Copyright (c) OpenMMLab. All rights reserved. | ||
from lmdeploy.disagg.config import MigrationBackend | ||
from lmdeploy.disagg.config import EngineRole, MigrationProtocol | ||
from lmdeploy.disagg.config import EngineRole, MigrationBackend | ||
from lmdeploy.utils import get_max_batch_size | ||
|
||
from .cli import CLI | ||
|
@@ -131,18 +130,14 @@ def add_parser_api_server(): | |
default='Hybrid', | ||
choices=['Hybrid', 'Prefill', 'Decode'], | ||
help='Hybrid for Non-Disaggregated Engine;' | ||
'Prefill for Disaggregated Prefill Engine;' | ||
'Decode fro Disaggregated Decode Engine;') | ||
'Prefill for Disaggregated Prefill Engine;' | ||
'Decode for Disaggregated Decode Engine;') | ||
parser.add_argument('--migration-backend', | ||
type=str, | ||
default='DLSlime', | ||
choices=['DLSlime', 'Mooncake', 'InfiniStore'], | ||
help='kvcache migration management backend when PD disaggregation') | ||
parser.add_argument('--available-nics', | ||
type=str, | ||
nargs="+", | ||
default=None, | ||
help='available-nics') | ||
parser.add_argument('--available-nics', type=str, nargs='+', default=None, help='available-nics') | ||
# common args | ||
ArgumentHelper.backend(parser) | ||
ArgumentHelper.log_level(parser) | ||
|
@@ -260,9 +255,7 @@ def add_parser_proxy(): | |
choices=['Ethernet', 'IB'], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is |
||
default='Ethernet', | ||
help='RDMA Link Type') | ||
parser.add_argument('--disable-gdr', | ||
action="store_true", | ||
help='with GPU Direct Memory Access') | ||
parser.add_argument('--disable-gdr', action='store_true', help='with GPU Direct Memory Access') | ||
|
||
ArgumentHelper.api_keys(parser) | ||
ArgumentHelper.ssl(parser) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,12 @@ | ||
# LMDeploy-DistServe | ||
|
||
## Key Components | ||
|
||
1. **Router Service**: Coordinates between prefill/decode engines | ||
4. **Migration Manager**: Facilitates high-performance memory sharing | ||
2. **Migration Manager**: Facilitates high-performance memory sharing | ||
|
||
## Installation | ||
|
||
``` | ||
# Inference Engine | ||
pip install lmdeploy[all] >= 0.7.0 | ||
|
@@ -14,10 +16,12 @@ pip install dlslime==0.0.1.post2 | |
``` | ||
|
||
## Quick Start | ||
|
||
### 1. Configure Endpoints | ||
|
||
First deploy your prefill and decode engines. | ||
|
||
``` shell | ||
```shell | ||
# Prefill Engine | ||
CUDA_VISIBLE_DEVICES=0,1 lmdeploy serve api_server internlm/internlm2_5-7b-chat --server-port 23333 --role Prefill --tp 2 --cache-block-seq 32 | ||
# Decode Engine | ||
|
@@ -26,7 +30,7 @@ CUDA_VISIBLE_DEVICES=2,3 lmdeploy serve api_server internlm/internlm2_5-7b-chat | |
|
||
### 2. Launch Router Service | ||
|
||
``` shell | ||
```shell | ||
lmdeploy serve proxy | ||
--server-name 10.130.8.139 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We'd better not specify a real IP in the user guide. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The default |
||
--server-port 5000 | ||
|
@@ -50,22 +54,27 @@ curl -X POST "http://localhost:5000/v1/completions" \ | |
|
||
### RDMA Connection Failed: | ||
|
||
``` bash | ||
```bash | ||
ibstatus # Verify IB device status | ||
ibv_devinfo # Check device capabilities | ||
``` | ||
|
||
### Check NVSHMEM configuration: | ||
|
||
Make sure to verify NVSHMEM installation. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you kindly provide the checking method or related url links? |
||
|
||
## Fault tolerance | ||
|
||
### CacheFree Issue | ||
|
||
When the Decode Engine completes migration, it sends a FreeCache request to the Prefill Engine. However, if the connection fails or the Decode Engine encounters an exception, Cache Free may fail, leading to memory leaks. Future improvements may include: | ||
|
||
- Exception monitoring in the Proxy to automatically release unreferenced memory. | ||
- Adding a timeout mechanism to force cache release if a response is delayed. | ||
| ||
| ||
|
||
### ConnectionPool Issue | ||
|
||
Currently, if the Proxy disconnects, the connection pool must be warmed up again. A future enhancement could involve: | ||
|
||
A dedicated connection pool management server (e.g., using Raft-based tools like ETCD, as mentioned in Mooncake) to improve connection discovery and avoid repeated warmups. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,24 +1,24 @@ | ||
from typing import Dict | ||
# Copyright (c) OpenMMLab. All rights reserved. | ||
from lmdeploy.logger import get_logger | ||
|
||
logger = get_logger("lmdeploy") | ||
|
||
logger = get_logger('lmdeploy') | ||
|
||
try: | ||
logger.debug("Registering DLSlime Backend") | ||
logger.debug('Registering DLSlime Backend') | ||
from .dlslime import DLSlimeBackend | ||
except ImportError as e: | ||
logger.debug("Disable DLSlime Backend") | ||
except ImportError: | ||
logger.warning('Disable DLSlime Backend') | ||
|
||
try: | ||
logger.debug("Registering Mooncake Backend") | ||
logger.debug('Registering Mooncake Backend') | ||
from .mooncake import MooncakeBackend | ||
except ImportError as e: | ||
logger.debug("Disable Mooncake Backend") | ||
|
||
except ImportError: | ||
logger.warning('Disable Mooncake Backend') | ||
|
||
try: | ||
logger.debug("Registering InfiniStoreBackend Backend") | ||
logger.debug('Registering InfiniStoreBackend Backend') | ||
from .infinistore import InfiniStoreBackend | ||
except ImportError as e: | ||
logger.debug("Disable InfiniStoreBackend Backend") | ||
except ImportError: | ||
logger.warning('Disable InfiniStoreBackend Backend') | ||
|
||
__all__ = [DLSlimeBackend, MooncakeBackend, InfiniStoreBackend] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,4 @@ | ||
from lmdeploy.disagg.config import MigrationBackend | ||
# Copyright (c) OpenMMLab. All rights reserved. | ||
from mmengine.registry import Registry | ||
|
||
|
||
MIGRATION_BACKENDS = {} | ||
|
||
|
||
def register_migration_backend(backend_name: MigrationBackend): | ||
def register(cls): | ||
MIGRATION_BACKENDS[backend_name] = cls | ||
return cls | ||
|
||
return register | ||
MIGRATION_BACKENDS = Registry('migration_backend', locations=['lmdeploy.disagg.backend.backend']) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,32 +1,21 @@ | ||
from typing import Dict | ||
|
||
# Copyright (c) OpenMMLab. All rights reserved. | ||
import asyncio | ||
|
||
from lmdeploy.logger import get_logger | ||
|
||
from lmdeploy.disagg.request import DistServeConnectionRequest | ||
from lmdeploy.disagg.messages import ( | ||
DistServeRegisterMRMessage, | ||
MigrationAssignment | ||
) | ||
|
||
from lmdeploy.disagg.backend.base import MigrationBackendImpl | ||
from lmdeploy.disagg.backend.backend import register_migration_backend | ||
|
||
from lmdeploy.disagg.config import ( | ||
DistServeEngineConfig, | ||
MigrationBackend, | ||
MigrationProtocol | ||
) | ||
from lmdeploy.disagg.request import DistServeInitRequest | ||
from typing import Dict | ||
|
||
from dlslime import RDMAEndpoint, available_nic | ||
|
||
from lmdeploy.disagg.backend.backend import MIGRATION_BACKENDS | ||
from lmdeploy.disagg.backend.base import MigrationBackendImpl | ||
from lmdeploy.disagg.config import DistServeEngineConfig, MigrationBackend, MigrationProtocol | ||
from lmdeploy.disagg.messages import DistServeRegisterMRMessage, MigrationAssignment | ||
from lmdeploy.disagg.request import DistServeConnectionRequest, DistServeInitRequest | ||
from lmdeploy.logger import get_logger | ||
|
||
logger = get_logger("lmdeploy") | ||
logger = get_logger('lmdeploy') | ||
|
||
|
||
class DLSlimeMigrationManagement: | ||
|
||
def __init__(self, init_request: DistServeInitRequest): | ||
self.rank = init_request.rank | ||
self.local_engine_config: DistServeEngineConfig = init_request.local_engine_config | ||
|
@@ -39,47 +28,43 @@ def __init__(self, init_request: DistServeInitRequest): | |
if init_request.rdma_config: | ||
nics = self.local_engine_config.available_nics or available_nic() | ||
device_name = nics[self.rank % len(nics)] | ||
logger.info(f"use device {device_name} for kv migration") | ||
self.endpoint[MigrationProtocol.RDMA] = RDMAEndpoint( | ||
device_name=device_name, | ||
ib_port=1, | ||
link_type=init_request.rdma_config.link_type.name | ||
) | ||
logger.info(f'use device {device_name} for kv migration') | ||
self.endpoint[MigrationProtocol.RDMA] = RDMAEndpoint(device_name=device_name, | ||
ib_port=1, | ||
link_type=init_request.rdma_config.link_type.name) | ||
if init_request.nvlink_init_request: | ||
raise NotImplementedError | ||
if init_request.tcp_init_request: | ||
raise NotImplementedError | ||
|
||
def register_memory_region(self, register_mr_request: DistServeRegisterMRMessage): | ||
self.endpoint[register_mr_request.protocol].register_memory_region( | ||
register_mr_request.mr_key, | ||
register_mr_request.addr, | ||
register_mr_request.length | ||
) | ||
self.endpoint[register_mr_request.protocol].register_memory_region(register_mr_request.mr_key, | ||
register_mr_request.addr, | ||
register_mr_request.length) | ||
|
||
def connect_to(self, connect_request: DistServeConnectionRequest): | ||
self.endpoint[connect_request.protocol].connect_to(connect_request.remote_endpoint_info) | ||
|
||
async def p2p_migrate(self, assignment: MigrationAssignment): | ||
max_batch = 4096 + 2048 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What the two magic numbers represent? |
||
for i in range(0, len(assignment.target_offset), max_batch): | ||
await asyncio.wait_for(self.endpoint[assignment.protocol].read_batch_async( | ||
assignment.mr_key, | ||
assignment.target_offset[i: i+max_batch], | ||
assignment.source_offset[i: i+max_batch], | ||
assignment.length | ||
), 15) | ||
await asyncio.wait_for( | ||
self.endpoint[assignment.protocol].read_batch_async(assignment.mr_key, | ||
assignment.target_offset[i:i + max_batch], | ||
assignment.source_offset[i:i + max_batch], | ||
assignment.length), 15) | ||
|
||
|
||
@register_migration_backend(MigrationBackend.DLSlime) | ||
@MIGRATION_BACKENDS.register_module(MigrationBackend.DLSlime.name) | ||
class DLSlimeBackend(MigrationBackendImpl): | ||
|
||
def __init__(self): | ||
self.links: Dict[int, DLSlimeMigrationManagement] = {} | ||
|
||
def p2p_initialize(self, init_request: DistServeInitRequest): | ||
self.links[init_request.remote_engine_id] = DLSlimeMigrationManagement(init_request) | ||
|
||
def register_memory_region(self, register_mr_request:DistServeRegisterMRMessage): | ||
def register_memory_region(self, register_mr_request: DistServeRegisterMRMessage): | ||
self.links[register_mr_request.remote_engine_id].register_memory_region(register_mr_request) | ||
|
||
def endpoint_info(self, remote_engine_id: int, protocol: MigrationProtocol): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,24 +1,18 @@ | ||
from lmdeploy.disagg.messages import ( | ||
DistServeRegisterMRMessage, | ||
MigrationAssignment | ||
) | ||
|
||
from lmdeploy.disagg.backend.backend import register_migration_backend | ||
# Copyright (c) OpenMMLab. All rights reserved. | ||
from lmdeploy.disagg.backend.backend import MIGRATION_BACKENDS | ||
from lmdeploy.disagg.backend.base import MigrationBackendImpl | ||
from lmdeploy.disagg.config import MigrationProtocol | ||
from lmdeploy.disagg.request import ( | ||
DistServeInitRequest, | ||
DistServeConnectionRequest | ||
) | ||
from lmdeploy.disagg.config import MigrationBackend | ||
from lmdeploy.disagg.config import MigrationBackend, MigrationProtocol | ||
from lmdeploy.disagg.messages import DistServeRegisterMRMessage, MigrationAssignment | ||
from lmdeploy.disagg.request import DistServeConnectionRequest, DistServeInitRequest | ||
|
||
|
||
@register_migration_backend(MigrationBackend.InfiniStore) | ||
@MIGRATION_BACKENDS.register_module(MigrationBackend.InfiniStore.name) | ||
class InfiniStoreBackend(MigrationBackendImpl): | ||
|
||
def p2p_initialize(self, init_request: DistServeInitRequest): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if this backend is not supported, we can remove it from choices in cli arguments |
||
raise NotImplementedError | ||
|
||
def register_memory_region(self, register_mr_request:DistServeRegisterMRMessage): | ||
def register_memory_region(self, register_mr_request: DistServeRegisterMRMessage): | ||
raise NotImplementedError | ||
|
||
def endpoint_info(self, remote_engine_id: int, protocol: MigrationProtocol): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,20 +1,18 @@ | ||
from lmdeploy.disagg.backend.backend import register_migration_backend | ||
# Copyright (c) OpenMMLab. All rights reserved. | ||
from lmdeploy.disagg.backend.backend import MIGRATION_BACKENDS | ||
from lmdeploy.disagg.backend.base import MigrationBackendImpl | ||
from lmdeploy.disagg.config import MigrationProtocol | ||
from lmdeploy.disagg.config import MigrationBackend, MigrationProtocol | ||
from lmdeploy.disagg.messages import DistServeRegisterMRMessage, MigrationAssignment | ||
from lmdeploy.disagg.request import ( | ||
DistServeInitRequest, | ||
DistServeConnectionRequest | ||
) | ||
from lmdeploy.disagg.config import MigrationBackend | ||
from lmdeploy.disagg.request import DistServeConnectionRequest, DistServeInitRequest | ||
|
||
|
||
@register_migration_backend(MigrationBackend.Mooncake) | ||
@MIGRATION_BACKENDS.register_module(MigrationBackend.Mooncake.name) | ||
class MooncakeBackend(MigrationBackendImpl): | ||
|
||
def p2p_initialize(self, init_request: DistServeInitRequest): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if this backend is not supported, we can remove it from choices in cli arguments There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @RunningLeon Let's keep it. @JimyMa will work with mooncake team to support it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lgtm |
||
raise NotImplementedError | ||
|
||
def register_memory_region(self, register_mr_request:DistServeRegisterMRMessage): | ||
def register_memory_region(self, register_mr_request: DistServeRegisterMRMessage): | ||
raise NotImplementedError | ||
|
||
def endpoint_info(self, remote_engine_id: int, protocol: MigrationProtocol): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get
available-nics
somehow instead of passing it manually?