[c10d] use allocator trace callbacks for NCCL PG register#112850
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112850
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e0270ca with merge base 9c1fb2c ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D50726970 |
f35b398 to
492b37e
Compare
|
This pull request was exported from Phabricator. Differential Revision: D50726970 |
492b37e to
9edfff5
Compare
|
This pull request was exported from Phabricator. Differential Revision: D50726970 |
…2850) Summary: We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations. How to track and register all cache segments: - It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree. - When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once. - When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator Test Plan: ``` NCCL_CTRAN_REGISTER=1 NCCL_ALLGATHER_ALGO=ctran:direct NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL pytest test/distributed/test_c10d_nccl.py -vsk test_tensor_register_hook ``` ``` ============================= test session starts ============================== platform linux -- Python 3.10.13, pytest-7.4.3, pluggy-1.3.0 -- /home/msi/conda/bin/python cachedir: .pytest_cache hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/data/users/msi/git/pytorch/.hypothesis/examples')) rootdir: /data/users/msi/git/pytorch configfile: pytest.ini plugins: typeguard-3.0.2, hypothesis-6.88.1 collecting ... collected 153 items / 152 deselected / 1 selected test/distributed/test_c10d_nccl.py::ProcessGroupNCCLTest::test_tensor_register_hook NCCL version 2.18.3meta-exp git-bc6c420+cuda11.8 2023-11-03T11:05:30-0700 devgpu001:3050625:3052473 [1] 1210737 ctran/mapper/ctranMapper.cc:67 NCCL WARN CTRAN: IB backend not enabled 2023-11-03T11:05:30-0700 devgpu001:3050625:3050625 [1] 1211154 NCCL INFO CTRAN-MAPPER: register buffer 0x7fc49a400000 len 2097152 (cached 0 registered 1 total cached 0 total registered 1 total dynamically registered 0) 2023-11-03T11:05:30-0700 devgpu001:3050625:3050625 [1] 1211276 NCCL INFO AllGather: opCount 0 sendbuff 0x7fc49a400000 recvbuff 0x7fc49a400200 count 8 datatype 0 op 0 root 0 comm 0x55063ff0 commHash 5314677976282377676 [nranks=2] stream 0x54809370 2023-11-03T11:05:30-0700 devgpu001:3050624:3051241 [0] 4572989 ctran/mapper/ctranMapper.cc:67 NCCL WARN CTRAN: IB backend not enabled 2023-11-03T11:05:30-0700 devgpu001:3050624:3050624 [0] 4573408 NCCL INFO CTRAN-MAPPER: register buffer 0x7f0fec400000 len 2097152 (cached 0 registered 1 total cached 0 total registered 1 total dynamically registered 0) 2023-11-03T11:05:30-0700 devgpu001:3050624:3050624 [0] 4573527 NCCL INFO AllGather: opCount 0 sendbuff 0x7f0fec400000 recvbuff 0x7f0fec400200 count 8 datatype 0 op 0 root 0 comm 0x41a01590 commHash 5314677976282377676 [nranks=2] stream 0x4118d470 PASSED [13.2144s] ====================== 1 passed, 152 deselected in 18.02s ====================== ``` ## Facebook Performance study with xlformer 150b model on 16 nodes: https://docs.google.com/document/d/1YJe1yplTb4IE2TtpYiuHTCTOZHJ10OxLm4JZXX95wfE/edit?usp=sharing Reviewed By: wconstab Differential Revision: D50726970
9edfff5 to
e0270ca
Compare
|
This pull request was exported from Phabricator. Differential Revision: D50726970 |
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…2850) Summary: We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations. How to track and register all cache segments: - It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree. - When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once. - When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator Test Plan: See test in D50726971 Reviewed By: wconstab Differential Revision: D50726970 Pull Request resolved: pytorch#112850 Approved by: https://github.com/wconstab
…2850) Summary: We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations. How to track and register all cache segments: - It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree. - When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once. - When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator Test Plan: See test in D50726971 Reviewed By: wconstab Differential Revision: D50726970 Pull Request resolved: pytorch#112850 Approved by: https://github.com/wconstab
Summary:
We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations.
How to track and register all cache segments:
Test Plan: See test in D50726971
Reviewed By: wconstab
Differential Revision: D50726970