📄 Paper |
📚 Documentation |
Discord |
WeChat Group |
🇨🇳 中文
VCCL is a collective communication library for GPUs. It provides communication primitives such as all-reduce, all-gather, reduce, broadcast, reduce-scatter, and general send/recv. It is compatible with PCIe, NVLink, and NVSwitch, and supports cross-node communication via InfiniBand Verbs or TCP/IP sockets. It can be used in single-node/multi-node, multi-process (e.g., MPI), or single-process applications.
VCCL redefines the GPU communication experience with three core capabilities: High Efficiency, High Availability, and High Visibility.
-
High Efficiency
Inspired by the DPDK design philosophy, VCCL introduces a “DPDK-Like P2P” high-performance scheduling mechanism, ensuring that GPUs remain fully utilized.
In the early days of high-speed networking on CPUs, achieving 10Gbps network performance was nearly impossible due to kernel stack overhead (multiple memory copies, interrupt handling inefficiencies). DPDK solved this by leveraging hugepage memory + zero-copy and moving the data path from kernel space to user space.
Similarly, current CUDA still faces limitations in communication/computation scheduling and API granularity (public sources note that ~20 out of 132 SMs on the H800 GPU are reserved for communication). VCCL adopts an analogous optimization strategy: offloading communication tasks from GPU CUDA stack to the CPU side, combined with zero-copy and global load balancing across pipeline parallel workflows (PP).
In training dense models with hundreds of billions of parameters, our internal benchmarks show that cluster-wide training compute efficiency improves by ~2% compared to state-of-the-art baselines (More about use zerocopy for training).Note: The SM-Free mode currently does not support fault tolerance or telemetry; this is planned as future work.
-
High Availability
Provides a lightweight local recovery fault-tolerance mechanism that effectively handles NIC failures and switch faults without significantly increasing system overhead. Concretly, when link fail occurs, VCCL can migrades the traffic within one iteration by creating a backup QP. Simultaneously, VCCL supports seamless traffic recovery to the primary QP once link integrity is re-established. In practice, this reduces overall training interruption rates by over 50% (More about fault tolerance). -
High Visibility
Offers microsecond-level sliding-window flow telemetry, enabling efficient bottleneck localization and congestion detection for performance tuning(More about flow telemetry).
For more information about VCCL and how to use it, please refer to the VCCL documentation.
Note: Currently, only source builds are supported.
$ git clone https://github.com/sii-research/VCCL.git
$ cd VCCL
$ make -j src.buildIf CUDA is not installed under /usr/local/cuda:
$ make src.build CUDA_HOME=<path to cuda install>Build artifacts are placed in the build/ directory (can be customized via BUILDDIR).
By default, VCCL compiles for all supported architectures. To speed up builds and reduce binary size, redefine NVCC_GENCODE (in makefiles/common.mk) to include only your target architecture(s):
# Example: build only for Hopper architecture (H100/H200, sm_90)
$ make -j80 src.build NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90"To install VCCL on your system, build a package and install it as root:
# Debian/Ubuntu
sudo apt install -y build-essential devscripts debhelper fakeroot
make pkg.debian.build
ls build/pkg/deb/
# RedHat/CentOS
sudo yum install -y rpm-build rpmdevtools
make pkg.redhat.build
ls build/pkg/rpm/
# OS-agnostic tarball
make pkg.txz.build
ls build/pkg/txz/Tests for VCCL are maintained separately at NVIDIA NCCL Tests.
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ make
$ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g <ngpus>-
This project is developed based on nccl_2.26.6-1, and retains upstream copyright and license information in the relevant files.
-
See the LICENSE file for detailed terms.
-
Thanks to the open-source community (including but not limited to NCCL and nccl-tests) for their outstanding work.