Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Releases: ebetica/DeepEP

DeepEP wheel with comm stream + cuda expert counts

26 Feb 16:55

Choose a tag to compare

Pre-built wheel for use in the aa-api pixi environment (PyTorch 2.8.0+cu128, Python 3.12, CUDA 12.8).

New in this release

  • use_default_stream_as_comm_stream: New Buffer.__init__ option to reuse the default CUDA stream for communication instead of allocating a separate one from the pool. When enabled, all stream_wait synchronization between comm and compute streams is skipped (since they're the same stream).
  • num_recv_tokens_per_expert_as_cuda: New kwarg on dispatch() / intranode_dispatch() / internode_dispatch(). When True, returns num_recv_tokens_per_expert as a CUDA int32 tensor (via cudaMemcpyAsync on the comm stream) instead of a Python list[int], avoiding a CPU→Python roundtrip.

Key changes from etongit/DeepEP v1.2.1

  • Relocatable RPATH: Uses $ORIGIN/nvidia/nvshmem/lib instead of hardcoded build path, so the wheel works in any environment with nvidia-nvshmem-cu12 installed via pip.
  • nvshmem dependency: nvidia-nvshmem-cu12>=3.5.19 declared in pyproject.toml so pip pulls the correct version.
  • NVSHMEM 3.5.19: Required for CoreWeave IB device naming (ibpX instead of mlx5_X). Set NVSHMEM_HCA_PREFIX=ibp at runtime.
  • Commit 48bd800 adds a cuda device init and sync inside of buffer.py

Build environment

Built on a compute node with GPU access using the aa-api pixi env's Python and PyTorch:

srun --nodes=1 --ntasks=1 --gres=gpu:1 bash -c '
cd /mnt/main0/home/zlin/code/DeepEP
export PIXI_ENV=/mnt/main0/home/zlin/code/evos/aa-api/.pixi/envs/default
unset CXX CC CFLAGS CXXFLAGS CPPFLAGS LDFLAGS
export CUDA_HOME=/usr/local/cuda-12.8
export PATH=/usr/local/cuda-12.8/bin:/usr/local/bin:/usr/bin:$PIXI_ENV/bin:$PATH
rm -rf dist/ build/ *.egg-info
python setup.py bdist_wheel
'

Key requirements:

  • Must build on a GPU node (needs libcuda.so)
  • Must unset CXX/CC/FLAGS (pixi conda compilers can't handle CUDA)
  • Must use the target env's Python/torch to match ABI (torch 2.8.0 ≠ torch 2.10.0)
  • Python 3.12, PyTorch 2.8.0+cu128, CUDA 12.8

Usage in pixi.toml

deep-ep = { url = "https://github.com/ebetica/DeepEP/releases/download/v1.2.1-fix2/deep_ep-1.2.1+536a37a-cp312-cp312-linux_x86_64.whl" }
nvidia-nvshmem-cu12 = "==3.5.19"
libnvshmem3 = "==3.5.19"

DeepEP wheel with comm stream + cuda expert counts

17 Feb 18:57

Choose a tag to compare

Pre-built wheel for use in the aa-api pixi environment (PyTorch 2.8.0+cu128, Python 3.12, CUDA 12.8).

New in this release

  • use_default_stream_as_comm_stream: New Buffer.__init__ option to reuse the default CUDA stream for communication instead of allocating a separate one from the pool. When enabled, all stream_wait synchronization between comm and compute streams is skipped (since they're the same stream).
  • num_recv_tokens_per_expert_as_cuda: New kwarg on dispatch() / intranode_dispatch() / internode_dispatch(). When True, returns num_recv_tokens_per_expert as a CUDA int32 tensor (via cudaMemcpyAsync on the comm stream) instead of a Python list[int], avoiding a CPU→Python roundtrip.

Key changes from etongit/DeepEP v1.2.1

  • Relocatable RPATH: Uses $ORIGIN/nvidia/nvshmem/lib instead of hardcoded build path, so the wheel works in any environment with nvidia-nvshmem-cu12 installed via pip.
  • nvshmem dependency: nvidia-nvshmem-cu12>=3.5.19 declared in pyproject.toml so pip pulls the correct version.
  • NVSHMEM 3.5.19: Required for CoreWeave IB device naming (ibpX instead of mlx5_X). Set NVSHMEM_HCA_PREFIX=ibp at runtime.
  • Commit 48bd800 adds a cuda device init and sync inside of buffer.py

Build environment

Built on a compute node with GPU access using the aa-api pixi env's Python and PyTorch:

srun --nodes=1 --ntasks=1 --gres=gpu:1 bash -c '
cd /mnt/main0/home/zlin/code/DeepEP
export PIXI_ENV=/mnt/main0/home/zlin/code/evos/aa-api/.pixi/envs/default
unset CXX CC CFLAGS CXXFLAGS CPPFLAGS LDFLAGS
export CUDA_HOME=/usr/local/cuda-12.8
export PATH=/usr/local/cuda-12.8/bin:/usr/local/bin:/usr/bin:$PIXI_ENV/bin:$PATH
rm -rf dist/ build/ *.egg-info
python setup.py bdist_wheel
'

Key requirements:

  • Must build on a GPU node (needs libcuda.so)
  • Must unset CXX/CC/FLAGS (pixi conda compilers can't handle CUDA)
  • Must use the target env's Python/torch to match ABI (torch 2.8.0 ≠ torch 2.10.0)
  • Python 3.12, PyTorch 2.8.0+cu128, CUDA 12.8

Usage in pixi.toml

deep-ep = { url = "https://github.com/ebetica/DeepEP/releases/download/v1.2.2/deep_ep-1.2.1+ab2bb8b-cp312-cp312-linux_x86_64.whl" }
nvidia-nvshmem-cu12 = "==3.5.19"
libnvshmem3 = "==3.5.19"

DeepEP wheel with relocatable RPATH

07 Feb 01:25

Choose a tag to compare

Pre-built wheel for use in the aa-api pixi environment (PyTorch 2.8.0+cu128, Python 3.12, CUDA 12.8).

Key changes from etongit/DeepEP v1.2.1:

  • Relocatable RPATH: Uses $ORIGIN/nvidia/nvshmem/lib instead of hardcoded build path, so the wheel works in any environment with nvidia-nvshmem-cu12 installed via pip.
  • nvshmem dependency: nvidia-nvshmem-cu12>=3.5.19 declared in pyproject.toml so pip pulls the correct version.
  • NVSHMEM 3.5.19: Required for CoreWeave IB device naming (ibpX instead of mlx5_X). Set NVSHMEM_HCA_PREFIX=ibp at runtime.
  • Commit 48bd800 adds a cuda device init and sync inside of buffer.py

Build environment

Built on a compute node with GPU access using the aa-api pixi env's Python and PyTorch:

srun --nodes=1 --ntasks=1 --gres=gpu:1 bash -c '
cd /mnt/main0/home/zlin/code/DeepEP
export PIXI_ENV=/mnt/main0/home/zlin/code/evos/aa-api/.pixi/envs/default
unset CXX CC CFLAGS CXXFLAGS CPPFLAGS LDFLAGS
export CUDA_HOME=/usr/local/cuda-12.8
export PATH=/usr/local/cuda-12.8/bin:/usr/local/bin:/usr/bin:$PIXI_ENV/bin:$PATH
rm -rf dist/ build/ *.egg-info
python setup.py bdist_wheel
'

Key requirements:

  • Must build on a GPU node (needs libcuda.so)
  • Must unset CXX/CC/FLAGS (pixi conda compilers can't handle CUDA)
  • Must use the target env's Python/torch to match ABI (torch 2.8.0 ≠ torch 2.10.0)
  • Python 3.12, PyTorch 2.8.0+cu128, CUDA 12.8

Usage in pixi.toml

deep-ep = { url = "https://github.com/ebetica/DeepEP/releases/download/v1.2.1-fix/deep_ep-1.2.1+bdd0d6f-cp312-cp312-linux_x86_64.whl" }
nvidia-nvshmem-cu12 = "==3.5.19"
libnvshmem3 = "==3.5.19"