Releases: ebetica/DeepEP
Releases · ebetica/DeepEP
DeepEP wheel with comm stream + cuda expert counts
Pre-built wheel for use in the aa-api pixi environment (PyTorch 2.8.0+cu128, Python 3.12, CUDA 12.8).
New in this release
use_default_stream_as_comm_stream: NewBuffer.__init__option to reuse the default CUDA stream for communication instead of allocating a separate one from the pool. When enabled, allstream_waitsynchronization between comm and compute streams is skipped (since they're the same stream).num_recv_tokens_per_expert_as_cuda: New kwarg ondispatch()/intranode_dispatch()/internode_dispatch(). WhenTrue, returnsnum_recv_tokens_per_expertas a CUDAint32tensor (viacudaMemcpyAsyncon the comm stream) instead of a Pythonlist[int], avoiding a CPU→Python roundtrip.
Key changes from etongit/DeepEP v1.2.1
- Relocatable RPATH: Uses
$ORIGIN/nvidia/nvshmem/libinstead of hardcoded build path, so the wheel works in any environment withnvidia-nvshmem-cu12installed via pip. - nvshmem dependency:
nvidia-nvshmem-cu12>=3.5.19declared in pyproject.toml so pip pulls the correct version. - NVSHMEM 3.5.19: Required for CoreWeave IB device naming (
ibpXinstead ofmlx5_X). SetNVSHMEM_HCA_PREFIX=ibpat runtime. - Commit 48bd800 adds a cuda device init and sync inside of buffer.py
Build environment
Built on a compute node with GPU access using the aa-api pixi env's Python and PyTorch:
srun --nodes=1 --ntasks=1 --gres=gpu:1 bash -c '
cd /mnt/main0/home/zlin/code/DeepEP
export PIXI_ENV=/mnt/main0/home/zlin/code/evos/aa-api/.pixi/envs/default
unset CXX CC CFLAGS CXXFLAGS CPPFLAGS LDFLAGS
export CUDA_HOME=/usr/local/cuda-12.8
export PATH=/usr/local/cuda-12.8/bin:/usr/local/bin:/usr/bin:$PIXI_ENV/bin:$PATH
rm -rf dist/ build/ *.egg-info
python setup.py bdist_wheel
'Key requirements:
- Must build on a GPU node (needs libcuda.so)
- Must unset CXX/CC/FLAGS (pixi conda compilers can't handle CUDA)
- Must use the target env's Python/torch to match ABI (torch 2.8.0 ≠ torch 2.10.0)
- Python 3.12, PyTorch 2.8.0+cu128, CUDA 12.8
Usage in pixi.toml
deep-ep = { url = "https://github.com/ebetica/DeepEP/releases/download/v1.2.1-fix2/deep_ep-1.2.1+536a37a-cp312-cp312-linux_x86_64.whl" }
nvidia-nvshmem-cu12 = "==3.5.19"
libnvshmem3 = "==3.5.19"DeepEP wheel with comm stream + cuda expert counts
Pre-built wheel for use in the aa-api pixi environment (PyTorch 2.8.0+cu128, Python 3.12, CUDA 12.8).
New in this release
use_default_stream_as_comm_stream: NewBuffer.__init__option to reuse the default CUDA stream for communication instead of allocating a separate one from the pool. When enabled, allstream_waitsynchronization between comm and compute streams is skipped (since they're the same stream).num_recv_tokens_per_expert_as_cuda: New kwarg ondispatch()/intranode_dispatch()/internode_dispatch(). WhenTrue, returnsnum_recv_tokens_per_expertas a CUDAint32tensor (viacudaMemcpyAsyncon the comm stream) instead of a Pythonlist[int], avoiding a CPU→Python roundtrip.
Key changes from etongit/DeepEP v1.2.1
- Relocatable RPATH: Uses
$ORIGIN/nvidia/nvshmem/libinstead of hardcoded build path, so the wheel works in any environment withnvidia-nvshmem-cu12installed via pip. - nvshmem dependency:
nvidia-nvshmem-cu12>=3.5.19declared in pyproject.toml so pip pulls the correct version. - NVSHMEM 3.5.19: Required for CoreWeave IB device naming (
ibpXinstead ofmlx5_X). SetNVSHMEM_HCA_PREFIX=ibpat runtime. - Commit 48bd800 adds a cuda device init and sync inside of buffer.py
Build environment
Built on a compute node with GPU access using the aa-api pixi env's Python and PyTorch:
srun --nodes=1 --ntasks=1 --gres=gpu:1 bash -c '
cd /mnt/main0/home/zlin/code/DeepEP
export PIXI_ENV=/mnt/main0/home/zlin/code/evos/aa-api/.pixi/envs/default
unset CXX CC CFLAGS CXXFLAGS CPPFLAGS LDFLAGS
export CUDA_HOME=/usr/local/cuda-12.8
export PATH=/usr/local/cuda-12.8/bin:/usr/local/bin:/usr/bin:$PIXI_ENV/bin:$PATH
rm -rf dist/ build/ *.egg-info
python setup.py bdist_wheel
'Key requirements:
- Must build on a GPU node (needs libcuda.so)
- Must unset CXX/CC/FLAGS (pixi conda compilers can't handle CUDA)
- Must use the target env's Python/torch to match ABI (torch 2.8.0 ≠ torch 2.10.0)
- Python 3.12, PyTorch 2.8.0+cu128, CUDA 12.8
Usage in pixi.toml
deep-ep = { url = "https://github.com/ebetica/DeepEP/releases/download/v1.2.2/deep_ep-1.2.1+ab2bb8b-cp312-cp312-linux_x86_64.whl" }
nvidia-nvshmem-cu12 = "==3.5.19"
libnvshmem3 = "==3.5.19"DeepEP wheel with relocatable RPATH
Pre-built wheel for use in the aa-api pixi environment (PyTorch 2.8.0+cu128, Python 3.12, CUDA 12.8).
Key changes from etongit/DeepEP v1.2.1:
- Relocatable RPATH: Uses
$ORIGIN/nvidia/nvshmem/libinstead of hardcoded build path, so the wheel works in any environment withnvidia-nvshmem-cu12installed via pip. - nvshmem dependency:
nvidia-nvshmem-cu12>=3.5.19declared in pyproject.toml so pip pulls the correct version. - NVSHMEM 3.5.19: Required for CoreWeave IB device naming (
ibpXinstead ofmlx5_X). SetNVSHMEM_HCA_PREFIX=ibpat runtime. - Commit 48bd800 adds a cuda device init and sync inside of buffer.py
Build environment
Built on a compute node with GPU access using the aa-api pixi env's Python and PyTorch:
srun --nodes=1 --ntasks=1 --gres=gpu:1 bash -c '
cd /mnt/main0/home/zlin/code/DeepEP
export PIXI_ENV=/mnt/main0/home/zlin/code/evos/aa-api/.pixi/envs/default
unset CXX CC CFLAGS CXXFLAGS CPPFLAGS LDFLAGS
export CUDA_HOME=/usr/local/cuda-12.8
export PATH=/usr/local/cuda-12.8/bin:/usr/local/bin:/usr/bin:$PIXI_ENV/bin:$PATH
rm -rf dist/ build/ *.egg-info
python setup.py bdist_wheel
'Key requirements:
- Must build on a GPU node (needs libcuda.so)
- Must unset CXX/CC/FLAGS (pixi conda compilers can't handle CUDA)
- Must use the target env's Python/torch to match ABI (torch 2.8.0 ≠ torch 2.10.0)
- Python 3.12, PyTorch 2.8.0+cu128, CUDA 12.8
Usage in pixi.toml
deep-ep = { url = "https://github.com/ebetica/DeepEP/releases/download/v1.2.1-fix/deep_ep-1.2.1+bdd0d6f-cp312-cp312-linux_x86_64.whl" }
nvidia-nvshmem-cu12 = "==3.5.19"
libnvshmem3 = "==3.5.19"