NCCL - Single Process Multiple Context issue

Hello NVBit team, we are trying to use NVBit to trace nccl kernels in a single process / multiple contexts scenario. We've had some success in the past utilizing an older version of NCCL alongside the GROUP launching mode (we will miss cudaLaunchCooperativeKernelMultiDevice, gone with CUDA 13.0). If we follow these steps to get all_reduce_perf, and run it with 2 GPUs (we are using 2 H100s), 8 bytes

```
git clone https://github.com/NVIDIA/nccl.git
make -j src.build NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90"
git clone https://github.com/NVIDIA/nccl-tests.git
make -j NCCL_HOME=/nccl/nccl/build/
TOOL_VERBOSE=1 LD_PRELOAD=nvbit_release/tools/instr_count/instr_count.so ./all_reduce_perf -b 8 -e 8 -f 2 -g 2
```

we see an unending, repeating stream of instructions zapping by. (With TOOL_VERBOSE off, we naturally don't see any activity).

However, if we try the MPI version, so that we spawn one NVBit instance per GPU

```
make -j MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi/ NCCL_HOME=/root/nccl/nccl/build/
mpirun -np 2 -N 2 -x TOOL_VERBOSE=1 -x CUDA_INJECTION64_PATH=nvbit_release/tools/instr_count/instr_count.so ./all_reduce_perf -b 8 -e 8 -f 2 -g 1
```

we do see it finish eventually, resulting in two sets of instruction counts (because there are two NVBit instances running, one attached per process).

If we wanted to support single process / multiple context workloads, are there any tips/recommendations on what to try, maybe we are overlooking something here? Is there any way to replicate the behavior of old cudaLaunchCooperativeKernelMultiDevice with recent NVBits?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL - Single Process Multiple Context issue #159

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NCCL - Single Process Multiple Context issue #159

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions