Thanks to visit codestin.com
Credit goes to github.com

Skip to content

NCCL - Single Process Multiple Context issue #159

@cesar-avalos3

Description

@cesar-avalos3

Hello NVBit team, we are trying to use NVBit to trace nccl kernels in a single process / multiple contexts scenario. We've had some success in the past utilizing an older version of NCCL alongside the GROUP launching mode (we will miss cudaLaunchCooperativeKernelMultiDevice, gone with CUDA 13.0). If we follow these steps to get all_reduce_perf, and run it with 2 GPUs (we are using 2 H100s), 8 bytes

git clone https://github.com/NVIDIA/nccl.git
make -j src.build NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90"
git clone https://github.com/NVIDIA/nccl-tests.git
make -j NCCL_HOME=/nccl/nccl/build/
TOOL_VERBOSE=1 LD_PRELOAD=nvbit_release/tools/instr_count/instr_count.so ./all_reduce_perf -b 8 -e 8 -f 2 -g 2

we see an unending, repeating stream of instructions zapping by. (With TOOL_VERBOSE off, we naturally don't see any activity).

However, if we try the MPI version, so that we spawn one NVBit instance per GPU

make -j MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi/ NCCL_HOME=/root/nccl/nccl/build/
mpirun -np 2 -N 2 -x TOOL_VERBOSE=1 -x CUDA_INJECTION64_PATH=nvbit_release/tools/instr_count/instr_count.so ./all_reduce_perf -b 8 -e 8 -f 2 -g 1

we do see it finish eventually, resulting in two sets of instruction counts (because there are two NVBit instances running, one attached per process).

If we wanted to support single process / multiple context workloads, are there any tips/recommendations on what to try, maybe we are overlooking something here? Is there any way to replicate the behavior of old cudaLaunchCooperativeKernelMultiDevice with recent NVBits?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions