-
Notifications
You must be signed in to change notification settings - Fork 26
Description
Hello NVBit team, we are trying to use NVBit to trace nccl kernels in a single process / multiple contexts scenario. We've had some success in the past utilizing an older version of NCCL alongside the GROUP launching mode (we will miss cudaLaunchCooperativeKernelMultiDevice, gone with CUDA 13.0). If we follow these steps to get all_reduce_perf, and run it with 2 GPUs (we are using 2 H100s), 8 bytes
git clone https://github.com/NVIDIA/nccl.git
make -j src.build NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90"
git clone https://github.com/NVIDIA/nccl-tests.git
make -j NCCL_HOME=/nccl/nccl/build/
TOOL_VERBOSE=1 LD_PRELOAD=nvbit_release/tools/instr_count/instr_count.so ./all_reduce_perf -b 8 -e 8 -f 2 -g 2
we see an unending, repeating stream of instructions zapping by. (With TOOL_VERBOSE off, we naturally don't see any activity).
However, if we try the MPI version, so that we spawn one NVBit instance per GPU
make -j MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi/ NCCL_HOME=/root/nccl/nccl/build/
mpirun -np 2 -N 2 -x TOOL_VERBOSE=1 -x CUDA_INJECTION64_PATH=nvbit_release/tools/instr_count/instr_count.so ./all_reduce_perf -b 8 -e 8 -f 2 -g 1
we do see it finish eventually, resulting in two sets of instruction counts (because there are two NVBit instances running, one attached per process).
If we wanted to support single process / multiple context workloads, are there any tips/recommendations on what to try, maybe we are overlooking something here? Is there any way to replicate the behavior of old cudaLaunchCooperativeKernelMultiDevice with recent NVBits?