Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

README.md

General information

nccl_allocator is a module that enables ncclMemAlloc1 to be used within PyTorch for faster NCCL NVLS collective communications. It is mainly based on CUDAPluggableAllocator. The context manager nccl_allocator.nccl_mem(enabled=True) is used as a switch between cudaMalloc and ncclMemAlloc (if enabled=True it will use cudaMalloc).

Example usage:

Here is a minimalistic example:

import os
import torch
import torch.distributed as dist
import apex.contrib.nccl_allocator as nccl_allocator

rank = int(os.getenv("RANK"))
local_rank = int(os.getenv("LOCAL_RANK"))
world_size = int(os.getenv("WORLD_SIZE"))

nccl_allocator.init()

torch.cuda.set_device(local_rank)
dist.init_process_group(backend="nccl")

with nccl_allocator.nccl_mem():
	a = torch.ones(1024 * 1024 * 2, device="cuda")
dist.all_reduce(a)

torch.cuda.synchronize()

Please visit apex/contrib/examples/nccl_allocator for more examples.

IMPORTANT

There are several strict requirements:

  • PyTorch must include PR #112850
  • NCCL v2.19.4 and newer
  • NCCL NVLS requires CUDA Driver 530 and newer (tested on 535)

Footnotes

  1. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/bufferreg.html