You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[c10d] use allocator trace callbacks for NCCL PG register (#112850)
Summary:
We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations.
How to track and register all cache segments:
- It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree.
- When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once.
- When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator
Test Plan:
```
NCCL_CTRAN_REGISTER=1 NCCL_ALLGATHER_ALGO=ctran:direct NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL pytest test/distributed/test_c10d_nccl.py -vsk test_tensor_register_hook
```
```
============================= test session starts ==============================
platform linux -- Python 3.10.13, pytest-7.4.3, pluggy-1.3.0 -- /home/msi/conda/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/data/users/msi/git/pytorch/.hypothesis/examples'))
rootdir: /data/users/msi/git/pytorch
configfile: pytest.ini
plugins: typeguard-3.0.2, hypothesis-6.88.1
collecting ... collected 153 items / 152 deselected / 1 selected
test/distributed/test_c10d_nccl.py::ProcessGroupNCCLTest::test_tensor_register_hook NCCL version 2.18.3meta-exp git-bc6c420+cuda11.8
2023-11-03T11:05:30-0700 devgpu001:3050625:3052473 [1] 1210737 ctran/mapper/ctranMapper.cc:67 NCCL WARN CTRAN: IB backend not enabled
2023-11-03T11:05:30-0700 devgpu001:3050625:3050625 [1] 1211154 NCCL INFO CTRAN-MAPPER: register buffer 0x7fc49a400000 len 2097152 (cached 0 registered 1 total cached 0 total registered 1 total dynamically registered 0)
2023-11-03T11:05:30-0700 devgpu001:3050625:3050625 [1] 1211276 NCCL INFO AllGather: opCount 0 sendbuff 0x7fc49a400000 recvbuff 0x7fc49a400200 count 8 datatype 0 op 0 root 0 comm 0x55063ff0 commHash 5314677976282377676 [nranks=2] stream 0x54809370
2023-11-03T11:05:30-0700 devgpu001:3050624:3051241 [0] 4572989 ctran/mapper/ctranMapper.cc:67 NCCL WARN CTRAN: IB backend not enabled
2023-11-03T11:05:30-0700 devgpu001:3050624:3050624 [0] 4573408 NCCL INFO CTRAN-MAPPER: register buffer 0x7f0fec400000 len 2097152 (cached 0 registered 1 total cached 0 total registered 1 total dynamically registered 0)
2023-11-03T11:05:30-0700 devgpu001:3050624:3050624 [0] 4573527 NCCL INFO AllGather: opCount 0 sendbuff 0x7f0fec400000 recvbuff 0x7f0fec400200 count 8 datatype 0 op 0 root 0 comm 0x41a01590 commHash 5314677976282377676 [nranks=2] stream 0x4118d470
PASSED [13.2144s]
====================== 1 passed, 152 deselected in 18.02s ======================
```
## Facebook
Performance study with xlformer 150b model on 16 nodes:
https://docs.google.com/document/d/1YJe1yplTb4IE2TtpYiuHTCTOZHJ10OxLm4JZXX95wfE/edit?usp=sharing
Reviewed By: wconstab
Differential Revision: D50726970
0 commit comments