Codestin Search App

colesbury · 2019-02-14T18:31:47Z

NVIDIA changed the CUDA allocation behavior on Pascal GPUs. The
page size increased from 1MB to 2MB and allocations larger than 1MB
are now always page-aligned. Previously, allocations larger than 1MB
were aligned to 128KB boundaries.

This interacted poorly with the caching allocator. The remaining
memory in a page could only be filled by small cudaMalloc calls, but
the caching allocator never cudaMalloc's a chunk smaller than 1MB.
This behavior could also cause a large discrepancy between the memory
usage reported by nvidia-smi and the memory usage reported by
PyTorch, because nvidia-smi counts a partially used page as "full",
while PyTorch only counts the actual memory requested.

This PR makes a few changes to the caching allocator to better support
Pascal and Volta GPUs:

 - All cudaMalloc calls are now multiples of 2MB (the page size)
 - Requests between 1-10MB allocate (and split) a 20MB block to
   reduce wasted space due to rounding
 - Small requests are now packed into 2MB blocks (instead of 1MB)

This improves Mask R-CNN memory usage by 10-20% in internal tests on
Volta GPUs. Maxwell performance seems to be largely unchanged, but
it's possible that some use cases suffer slightly.

NVIDIA changed the CUDA allocation behavior on Pascal GPUs. The page size increased from 1MB to 2MB and allocations larger than 1MB are now always page-aligned. Previously, allocations larger than 1MB were aligned to 128KB boundaries. This interacted poorly with the caching allocator. The remaining memory in a page could only be filled by small cudaMalloc calls, but the caching allocator never cudaMalloc's a chunk smaller than 1MB. This behavior could also cause a large discrepancy between the memory usage reported by nvidia-smi and the memory usage reported by PyTorch, because nvidia-smi counts a partially used page as "full", while PyTorch only counts the actual memory requested. This PR makes a few changes to the caching allocator to better support Pascal and Volta GPUs: - All cudaMalloc calls are now multiples of 2MB (the page size) - Requests between 1-10MB allocate (and split) a 20MB block to reduce wasted space due to rounding - Small requests are now packed into 2MB blocks (instead of 1MB) This improves Mask R-CNN memory usage by 10-20% in internal tests on Volta GPUs. Maxwell performance seems to be largely unchanged, but it's possible that some use cases suffer slightly.

facebook-github-bot

@colesbury has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: ``` NVIDIA changed the CUDA allocation behavior on Pascal GPUs. The page size increased from 1MB to 2MB and allocations larger than 1MB are now always page-aligned. Previously, allocations larger than 1MB were aligned to 128KB boundaries. This interacted poorly with the caching allocator. The remaining memory in a page could only be filled by small cudaMalloc calls, but the caching allocator never cudaMalloc's a chunk smaller than 1MB. This behavior could also cause a large discrepancy between the memory usage reported by nvidia-smi and the memory usage reported by PyTorch, because nvidia-smi counts a partially used page as "full", while PyTorch only counts the actual memory requested. This PR makes a few changes to the caching allocator to better support Pascal and Volta GPUs: - All cudaMalloc calls are now multiples of 2MB (the page size) - Requests between 1-10MB allocate (and split) a 20MB block to reduce wasted space due to rounding - Small requests are now packed into 2MB blocks (instead of 1MB) This improves Mask R-CNN memory usage by 10-20% in internal tests on Volta GPUs. Maxwell performance seems to be largely unchanged, but it's possible that some use cases suffer slightly. ``` Pull Request resolved: pytorch#17120 Differential Revision: D14301536 Pulled By: colesbury fbshipit-source-id: a8282315ea8f7b8ca149b5066fdeaecd0d404edf

colesbury requested a review from gchanan February 14, 2019 18:31

fmassa mentioned this pull request Feb 14, 2019

Memory Usage is higher than other Pytorch implementation? facebookresearch/maskrcnn-benchmark#182

Open

colesbury force-pushed the allocator branch from ed8950a to 16dd76c Compare February 19, 2019 18:42

albanD reviewed Feb 21, 2019

View reviewed changes

Comment thread c10/cuda/CUDACachingAllocator.cpp Outdated

colesbury added 2 commits March 1, 2019 10:07

Update naming free list -> pool

836f588

colesbury force-pushed the allocator branch from b5734da to 836f588 Compare March 1, 2019 18:08

gchanan approved these changes Mar 1, 2019

View reviewed changes

gchanan reviewed Mar 1, 2019

View reviewed changes

Comment thread c10/cuda/CUDACachingAllocator.cpp Outdated

Use C++ static_cast

e1746d4

facebook-github-bot reviewed Mar 4, 2019

View reviewed changes

facebook-github-bot closed this in 079093a Mar 5, 2019

pytorchbot added the merged label Mar 5, 2019

soumith mentioned this pull request Mar 13, 2019

Cuda Out of Memory Error after few successful batches #15563

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve caching allocator for Pascal and newer GPUs.#17120

Improve caching allocator for Pascal and newer GPUs.#17120
colesbury wants to merge 3 commits into
pytorch:masterfrom
colesbury:allocator

colesbury commented Feb 14, 2019

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

colesbury commented Feb 14, 2019

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants