Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Implement reference counting for shared IPC CUDA tensors#16854

Closed
VitalyFedyunin wants to merge 50 commits into
pytorch:masterfrom
VitalyFedyunin:cuda_ipc_deallocation_share_counters_file
Closed

Implement reference counting for shared IPC CUDA tensors#16854
VitalyFedyunin wants to merge 50 commits into
pytorch:masterfrom
VitalyFedyunin:cuda_ipc_deallocation_share_counters_file

Conversation

@VitalyFedyunin
Copy link
Copy Markdown
Contributor

This is to fix #16141 and similar issues.

The idea is to track a reference to every shared CUDA Storage and deallocate memory only after a consumer process deallocates received Storage.

@ezyang Done with cleanup. Same (insignificantly better) performance as in file-per-share solution, but handles millions of shared tensors easily. Note [ ] documentation in progress.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Feb 8, 2019

What was the original performance problem, in the end?

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Feb 8, 2019

Test failures are real:

Feb 07 23:43:10 test_cuda (__main__.TestMultiprocessing) ... terminate called after throwing an instance of 'c10::Error'
Feb 07 23:43:10   what():  unable to open shared memory object </torch_5993_2546988397> in read-write mode (THMapAllocator at /var/lib/jenkins/workspace/aten/src/TH/THAllocator.cpp:234)
Feb 07 23:43:10 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7fbcfc569e2a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Feb 07 23:43:10 frame #1: THMapAllocator::THMapAllocator(WithFd, char const*, int, int, unsigned long) + 0x64e (0x7fbcfd22e78e in /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
Feb 07 23:43:10 frame #2: THRefcountedMapAllocator::THRefcountedMapAllocator(char const*, int, unsigned long) + 0x2f (0x7fbcfd22f55f in /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
Feb 07 23:43:10 frame #3: THRefcountedMapAllocator::makeDataPtr(char const*, int, unsigned long, unsigned long*) + 0x4b (0x7fbcfd22f5eb in /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
Feb 07 23:43:10 frame #4: <unknown function> + 0x4d3b96 (0x7fbd08bd0b96 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Feb 07 23:43:10 frame #5: THCIpcDeleter::~THCIpcDeleter() + 0xa5 (0x7fbd00ab9085 in /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
Feb 07 23:43:10 frame #6: deleteTHCIpcDeleter(void*) + 0xe (0x7fbd00ab90ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
Feb 07 23:43:10 frame #7: c10::TensorImpl::release_resources() + 0x61 (0x7fbcfc55ebc1 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Feb 07 23:43:10 frame #8: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7fbcfbc104be in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Feb 8, 2019

I still don't see a Note explaining the general strategy in the PR ;) For example, the most critical information to add is under what circumstances collect gets run, since that affects the overall performance of this scheme. I shouldn't have to read the code to figure it out!

EDIT: Sorry, I didn't see your note about documentation being in progress :)

Comment thread c10/core/StorageImpl.h
Comment thread torch/csrc/CudaIPCTypes.h Outdated
CudaIPCSentData(std::string handle, int64_t offset, int64_t* counter_ptr)
: handle(handle), offset(offset), counter_ptr(counter_ptr){};
~CudaIPCSentData();
int64_t get();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd definitely appreciate a doc here

Comment thread torch/csrc/CudaIPCTypes.h Outdated
Comment thread torch/csrc/CudaIPCTypes.h Outdated
Comment thread torch/csrc/CudaIPCTypes.h Outdated
@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Feb 8, 2019

Needs tests

Comment thread torch/csrc/CudaIPCTypes.cpp Outdated
Comment thread torch/csrc/CudaIPCTypes.cpp Outdated
Comment thread torch/csrc/CudaIPCTypes.cpp Outdated
Comment thread torch/csrc/CudaIPCTypes.cpp Outdated
Comment thread torch/csrc/CudaIPCTypes.cpp Outdated
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin VitalyFedyunin force-pushed the cuda_ipc_deallocation_share_counters_file branch from 3ba803c to e146141 Compare February 14, 2019 18:46
@ezyang ezyang changed the title Implement reference counting for shared ICP CUDA tensors Implement reference counting for shared IPC CUDA tensors Feb 14, 2019
Comment thread c10/core/StorageImpl.h Outdated
Comment thread c10/cuda/CUDACachingAllocator.cpp Outdated
Comment thread test/test_multiprocessing.py Outdated
Comment thread test/test_multiprocessing.py Outdated
Comment thread test/test_multiprocessing.py Outdated
Comment thread test/test_multiprocessing.py Outdated
Comment thread torch/csrc/CudaIPCTypes.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this shouldn't go in the torch/csrc/cuda folder. I'm not too familiar with how the build works here, but it seems worth looking into, or maybe asking @zdevito about

Comment thread torch/csrc/CudaIPCTypes.cpp Outdated
Comment thread torch/csrc/CudaIPCTypes.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message can be even better if it offers some information about what this means, and advice about how to remediate the situation. A link to more detailed docs is often good enough.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost assuredly failing lint

Comment thread torch/csrc/CudaIPCTypes.h
Comment thread torch/csrc/CudaIPCTypes.cpp Outdated
Comment thread torch/cuda/__init__.py Outdated
@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Mar 21, 2019

OK, finished reviewing the new stuff. Note that you want to make the dev docs discoverable. The best way to do it is to cite them from the relevant code, so that when people are reading the code they know where to go to get the info. WE use the convention Note [Blah blah] and See Note [Blah blah] to do these cross references.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin
Copy link
Copy Markdown
Contributor Author

@pytorchbot retest this please

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin
Copy link
Copy Markdown
Contributor Author

@pytorchbot retest this please

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@VitalyFedyunin merged this pull request in 5653a91.

facebook-github-bot pushed a commit that referenced this pull request May 26, 2019
…ed (#19904)

Summary:
The mp notes are not updated after #16854. (The torch.multiprocessing page is.)
Pull Request resolved: #19904

Differential Revision: D15509661

Pulled By: soumith

fbshipit-source-id: 7c11e14a6c804498dda3adbf19710e63e6a564a0
@neerajprad neerajprad mentioned this pull request Jul 12, 2019
6 tasks
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
This is to fix pytorch#16141 and similar issues.

The idea is to track a reference to every shared CUDA Storage and deallocate memory only after a consumer process deallocates received Storage.

ezyang Done with cleanup. Same (insignificantly better) performance as in file-per-share solution, but handles millions of shared tensors easily. Note [ ] documentation in progress.
Pull Request resolved: pytorch#16854

Differential Revision: D13994490

Pulled By: VitalyFedyunin

fbshipit-source-id: 565148ec3ac4fafb32d37fde0486b325bed6fbd1
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…ed (pytorch#19904)

Summary:
The mp notes are not updated after pytorch#16854. (The torch.multiprocessing page is.)
Pull Request resolved: pytorch#19904

Differential Revision: D15509661

Pulled By: soumith

fbshipit-source-id: 7c11e14a6c804498dda3adbf19710e63e6a564a0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[multiprocessing] does not play well with distributions in GPU

5 participants