rccl

RCCL

ROCm Communication Collectives Library

Note: The published documentation is available at RCCL in an organized easy-to-read format that includes a table of contents and search functionality. The documentation source files reside in the rccl/docs folder in this repository. As with all ROCm projects, the documentation is open source. For more information, see Contribute to ROCm documentation.

Introduction

RCCL (pronounced "Rickle") is a stand-alone library of standard collective communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, gather, scatter, and all-to-all. There is also initial support for direct GPU-to-GPU send and receive operations. It has been optimized to achieve high bandwidth on platforms using PCIe, xGMI as well as networking using InfiniBand Verbs or TCP/IP sockets. RCCL supports an arbitrary number of GPUs installed in a single node or multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.

The collective operations are implemented using ring and tree algorithms and have been optimized for throughput and latency. For best performance, small operations can be either batched into larger operations or aggregated through the API.

Requirements

ROCm supported GPUs
ROCm stack installed on the system (HIP runtime & HIP-Clang)

Quickstart RCCL Build

RCCL directly depends on HIP runtime plus the HIP-Clang compiler, which are part of the ROCm software stack. For ROCm installation instructions, see https://github.com/ROCm/ROCm.

The root of this repository has a helper script install.sh to build and install RCCL with a single command. It hard-codes configurations that can be specified through invoking cmake directly, but it's a great way to get started quickly and can serve as an example of how to build/install RCCL.

To build the library using the install script:

./install.sh

For more info on build options/flags when using the install script, use ./install.sh --help

./install.sh --help
RCCL build & installation helper script
 Options:
       --address-sanitizer     Build with address sanitizer enabled
       --amdgpu_targets        Only compile for specified GPU architecture(s). For multiple targets, separate by ';' (builds for all supported GPU architectures by default)
       --cmake-options         Pass additional CMake options (e.g. --cmake-options "-DFOO=BAR -DBAZ=ON")
       --debug                 Build debug library
       --debug-fast            Build debug library with lto optimization disabled (fast build times)
    -d|--dependencies          Install RCCL dependencies
       --disable-colltrace     Build without collective trace
       --disable-roctx         Build without ROCTX logging
       --disable-warp-speed    Disable WARP_SPEED kernel optimizations
       --dump-asm              Disassemble code and dump assembly with inline code
    -c|--enable-code-coverage  Enable code coverage
       --enable_backtrace      Build with custom backtrace support
       --enable-mpi-tests      Enable MPI-based tests (requires --debug and MPI installation; set MPI_PATH if not in /opt/ompi)
    -f|--fast                  Quick-build RCCL (local gpu arch only, no backtrace, and collective trace support)
       --force-reduce-pipeline Force reduce_copy sw pipeline to be used for every reduce-based collectives and datatypes
       --generate-sym-kernels  Generate symmetric memory kernels (default: OFF)
    -h|--help                  Prints this help message
    -i|--install               Install RCCL library (see --prefix argument below)
    -j|--jobs                  Specify how many parallel compilation jobs to run ($nproc by default)
       --kernel-resource-use   Dump GPU kernel resource usage (e.g., VGPRs, scratch, spill) at link stage
    -l|--local_gpu_only        Only compile for local GPU architecture
       --log-trace             Build with log trace enabled (i.e. NCCL_DEBUG=TRACE)
       --no_clean              Don't delete files if they already exist
       --npkit-enable          Compile with npkit enabled
       --openmp-test-enable    Enable OpenMP in rccl unit tests
    -p|--package_build         Build RCCL package
       --prefix                Specify custom directory to install RCCL to (default: `/opt/rocm`)
    -q|--quiet-warnings        Suppress majority of compiler warnings (not recommended)
       --rocshmem              Build with rocSHMEM support (for GDA AllToAll)
       --run_tests_all         Run all rccl unit tests (must be built already)
    -r|--run_tests_quick       Run small subset of rccl unit tests (must be built already)
       --static                Build RCCL as a static library instead of shared library
    -t|--tests_build           Build rccl unit tests, but do not run
       --time-trace            Plot the build time of RCCL (requires `ninja-build` package installed on the system)
       --verbose               Show compile commands

  Available RCCL-specific CMake options for --cmake-options:
    -DBUILD_EXT_EXAMPLES=ON               Build ext-{net,tuner,profiler} example plugins (default: OFF)
    -DDWORDX4_INTRINSICS=OFF              Disable dwordx4 intrinsics (default: ON)
    -DENABLE_COMPRESS=OFF                 Disable GPU code compression (default: ON)
    -DENABLE_IFC=ON                       Enable indirect function call (default: OFF)
    -DFAULT_INJECTION=OFF                 Disable fault injection (default: ON)
    -DPROFILE=ON                          Enable profiling (default: OFF)
    -DRCCL_ROCPROFILER_REGISTER=OFF       Disable rocprofiler-register support (default: ON)
    -DTIMETRACE=ON                        Enable time-trace during compilation (default: OFF)

  Environment variables:
    ONLY_FUNCS                 Build only specified collective functions (debug builds only).
                               Restricts GPU kernel generation to the listed collectives, significantly
                               reducing build time during development. Use '|' to separate multiple functions.
                               Example: ONLY_FUNCS="AllReduce|SendRecv" ./install.sh --debug -t
                               Available: AllReduce, Broadcast, Reduce, AllGather, ReduceScatter,
                                          AlltoAllPivot, SendRecv, AlltoAllGda, AlltoAllvGda
                               Advanced: Specify algo, protocol, redop, and type per collective.
                                 ONLY_FUNCS="AllReduce RING SIMPLE Sum f32|SendRecv"
    ROCSHMEM_INSTALL_DIR       Path to a pre-built rocSHMEM installation (skips building from source)

By default, RCCL builds for all GPU targets defined in DEFAULT_GPUS in CMakeLists.txt. To target specific GPU(s), and potentially reduce build time, use --amdgpu_targets as a ; separated string listing GPU(s) to target.

Manual build

To build the library using CMake:

$ git clone --recursive https://github.com/ROCm/rccl.git
$ cd rccl
$ mkdir build
$ cd build
$ cmake ..
$ make -j 16      # Or some other suitable number of parallel jobs

If you have already cloned, you can check out the remaining git submodules manually. rocSHMEM is not a submodule; to build RCCL with rocSHMEM from CMake, set ROCSHMEM_INSTALL_DIR or ROCSHMEM_SOURCE_DIR as described under rocSHMEM support below.

$ git submodule update --init --recursive --depth=1

You may substitute an installation path of your own choosing by passing CMAKE_INSTALL_PREFIX. For example:

$ cmake -DCMAKE_INSTALL_PREFIX=$PWD/rccl-install -DCMAKE_BUILD_TYPE=Release ..

Note: ensure rocm-cmake is installed, apt install rocm-cmake.

To build the RCCL package and install package :

Assuming you have already cloned this repository and built the library as shown in the previous section:

$ cd rccl/build
$ make package
$ sudo dpkg -i *.deb

RCCL package install requires sudo/root access because it installs under /opt/rocm/. This is an optional step as RCCL can instead be used directly by including the path containing librccl.so.

Docker build

Refer to docker/README.md

Tests

There are rccl unit tests implemented with the Googletest framework in RCCL. The rccl unit tests require Googletest 1.10 or higher to build and execute properly (installed with the -d option to install.sh). To invoke the rccl unit tests, go to the build folder, then the test subfolder, and execute the appropriate rccl unit test executable(s).

rccl unit test names are now of the format:

CollectiveCall.[Type of test]

Filtering of rccl unit tests should be done with environment variable and by passing the --gtest_filter command line flag, for example:

UT_DATATYPES=ncclBfloat16 UT_REDOPS=prod ./rccl-UnitTests --gtest_filter="AllReduce.C*"

will run only AllReduce correctness tests with float16 datatype. A list of available filtering environment variables appears at the top of every run. See "Running a Subset of the Tests" at https://google.github.io/googletest/advanced.html#running-a-subset-of-the-tests for more information on how to form more advanced filters.

There are also other performance and error-checking tests for RCCL. These are maintained separately at https://github.com/ROCm/rccl-tests. See the rccl-tests README for more information on how to build and run those tests.

rocSHMEM support

RCCL can use rocSHMEM's GPU Direct Async (GDA) backend to accelerate the AllToAll collective on supported multi-node setups. This is the only collective that currently uses rocSHMEM GDA inside RCCL.

Please consult the rocSHMEM documentation to see which NICs and drivers are required for GDA alltoall support.

Building with rocSHMEM

Using the install script:
```
./install.sh --rocshmem
```
By default (without ROCSHMEM_INSTALL_DIR), the script creates a sparse git worktree of the mono-repo at a pinned commit and passes that rocSHMEM tree to CMake as ROCSHMEM_SOURCE_DIR, so RCCL builds rocSHMEM via CMake ExternalProject. To use an already-built rocSHMEM instead, set ROCSHMEM_INSTALL_DIR to its install prefix before running the script.
Manual CMake (without install.sh)
You need InfiniBand Verbs development libraries on the system (libibverbs; e.g. rdma-core / libibverbs-dev on Debian/Ubuntu). Then enable rocSHMEM and supply either a pre-built install prefix or a path to the rocSHMEM CMake source tree (the directory that contains rocSHMEM’s top-level CMakeLists.txt, e.g. projects/rocshmem in the rocm-systems mono-repo):
```
# Option A — link against an existing rocSHMEM installation
cmake -DENABLE_ROCSHMEM=ON -DROCSHMEM_INSTALL_DIR=/path/to/rocshmem/prefix ..

# Option B — build rocSHMEM from source as part of the RCCL build
cmake -DENABLE_ROCSHMEM=ON -DROCSHMEM_SOURCE_DIR=/path/to/rocshmem/source ..
```
If neither ROCSHMEM_INSTALL_DIR (with a successful find_package(rocshmem_static)) nor ROCSHMEM_SOURCE_DIR is set, configuration fails with an error directing you to set ROCSHMEM_SOURCE_DIR (or use install.sh --rocshmem).

Runtime behavior

Users must set the following environment variables:

RCCL_ROCSHMEM_ENABLE (default: 1): Set to 0 to disable rocSHMEM usage in RCCL.
RCCL_ROCSHMEM_THRESHOLD (default: 262144 bytes): Maximum AllToAll message size (in bytes) for which the GDA path is used. The GDA path is only considered when this value is ≤ 1 MiB (1048576); larger thresholds fall back to the standard AllToAll implementation.

The GDA AllToAll path is selected only when all of the following hold: rocSHMEM is enabled at build and runtime, the GPU architecture is gfx942 (e.g. MI300X), the job is multi-node with 8 GPUs per node, and the AllToAll message size is ≤ RCCL_ROCSHMEM_THRESHOLD.

Library and API Documentation

Please refer to the RCCL Documentation Site for current documentation.

How to build documentation

Run the steps below to build documentation locally.

cd docs
pip3 install -r sphinx/requirements.txt
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html

Name		Name	Last commit message	Last commit date
parent directory ..
.claude/skills/feature-unit-testing		.claude/skills/feature-unit-testing
.github		.github
bindings/nccl4py		bindings/nccl4py
cmake		cmake
docker		docker
docs		docs
examples/06_device_api		examples/06_device_api
ext-net		ext-net
ext-profiler		ext-profiler
ext-src		ext-src
ext-tuner		ext-tuner
makefiles		makefiles
pkg		pkg
src		src
test		test
tools		tools
tuner		tuner
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CppCheckSuppressions.txt		CppCheckSuppressions.txt
LICENSE.txt		LICENSE.txt
Makefile		Makefile
NOTICES.txt		NOTICES.txt
README.md		README.md
install.sh		install.sh
rtest.xml		rtest.xml
toolchain-linux.cmake		toolchain-linux.cmake

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

RCCL

Introduction

Requirements

Quickstart RCCL Build

To build the library using the install script:

Manual build

To build the library using CMake:

To build the RCCL package and install package :

Docker build

Tests

rocSHMEM support

Library and API Documentation

How to build documentation

Copyright

FilesExpand file tree

rccl

Directory actions

More options

Directory actions

More options

Latest commit

History

rccl

Folders and files

parent directory

README.md

RCCL

Introduction

Requirements

Quickstart RCCL Build

To build the library using the install script:

Manual build

To build the library using CMake:

To build the RCCL package and install package :

Docker build

Tests

rocSHMEM support

Library and API Documentation

How to build documentation

Copyright