gpu mem pool strategy#11041
Conversation
fd64b96 to
b8b942e
Compare
| LOG(INFO) << "Using GPUPooledRoundedStorageManager."; | ||
| } else { | ||
| if (strategy != "Naive") { | ||
| LOG(INFO) << "Unknown memory pool strategy specified: " << strategy << "."; |
bcba6e2 to
de2a823
Compare
|
Still no clue what's going wrong with this PR. Nothing specific to windows, weirdly python2-GPU-win is good. |
| private: | ||
| void DirectFreeNoLock(Storage::Handle handle) { | ||
| cudaError_t err = cudaFree(handle.dptr); | ||
| size_t size = handle.size + NDEV; |
There was a problem hiding this comment.
are you sure + NDEV is not needed any more? what if NDEV=32 and min_chunk=33 and handle.size=30? Original code would allocate 62. New code would allocate 33
| */ | ||
| GPUPooledStorageManager() { | ||
| reserve_ = dmlc::GetEnv("MXNET_GPU_MEM_POOL_RESERVE", 5); | ||
| min_chunk_ = dmlc::GetEnv("MXNET_GPU_MEM_POOL_MIN_CHUNK", 4096); |
There was a problem hiding this comment.
page size instead of min chunk?
| void ReleaseAll(); | ||
| // used memory | ||
| size_t used_memory_ = 0; | ||
| size_t used_memory_ = 0, min_chunk_; |
| private: | ||
| #if __SIZEOF_SIZE_T__ == __SIZEOF_LONG__ | ||
|
|
||
| #if defined(__clang__) || defined(__GNUC__) |
There was a problem hiding this comment.
does this need to be so complicated? You just need to take the highest bit and shift left by 1 if it's smaller than size.
This is called the finding the MSB. See https://www.google.com/search?ei=__UNW-DMG6iF0wLqyr4g&q=how+to+find+most+significant+bit+in+c&oq=take+highest+bit&gs_l=psy-ab.1.0.0i71k1l8.0.0.0.4417.0.0.0.0.0.0.0.0..0.0....0...1c..64.psy-ab..0.0.0....0.LUbIFjlZyeU
There was a problem hiding this comment.
these builtins would utilize hardware instructions when available.
There was a problem hiding this comment.
Is it really faster? It looks too complicated.
There was a problem hiding this comment.
also the default implementation with pow and log is really slow
There was a problem hiding this comment.
I will change the default implementation to use bit shifting and then do a comparison
There was a problem hiding this comment.
I compared my current solution, the bit shifting, and static_cast<int>(std::ceil(std::log2(s))), with -O3 is turned on on my mac (clang), the speed looks like the following:
Running 10000000 iters.
Addr width 64
It took me 0.00981569 seconds. result: 223222785
It took me 0.128623 seconds. result: 223222785
It took me 0.0801588 seconds. result: 223222785
0319b42 to
63aac3f
Compare
|
I've simplified the implementation to exclude optimization using intrinsics and bit scans. They are backed up in https://github.com/szha/mxnet/tree/mem_strategy_backup |
|
|
||
| blacklist = [ | ||
| 'Windows.h', 'cublas_v2.h', 'cuda/tensor_gpu-inl.cuh', | ||
| 'Windows.h', 'intrin.h', 'cublas_v2.h', 'cuda/tensor_gpu-inl.cuh', |
e57bae9 to
9b39b72
Compare
|
|
||
| TEST(GPUStorage, Round_GPU) { | ||
| if (mxnet::test::unitTestsWithCuda) { | ||
| putenv("MXNET_GPU_MEM_POOL_ROUND_LINEAR_CUTOFF=20"); |
There was a problem hiding this comment.
How long does this variable persist? It could have side effects on other tests
| #include <mxnet/storage.h> | ||
| #include <cstdio> | ||
| #include "test_util.h" | ||
| #include "storage/pooled_storage_manager.h" |
There was a problem hiding this comment.
Duplicate import? I think it's already part of the storage namespace at mxnet/storage.h
d0d8bf7 to
00086f1
Compare
|
|
||
| from mxnet.test_utils import * | ||
| from common import setup_module, with_seed | ||
| from common import setup_module, with_seed, teardown |
There was a problem hiding this comment.
Is it really necessary to import this in every single test? Looks a bit ugly tbh
There was a problem hiding this comment.
applying this change would allow all tests within a module to finish before moving onto the next test, thus eliminating the case where side effect of tests in another module spills over to the next. In terms of testing practice, including a setup/teardown is common.
There was a problem hiding this comment.
Yeah, but we're not actually using it in most files, right?
There was a problem hiding this comment.
Ah in common.py :) But isn't it sufficient to import it there?
There was a problem hiding this comment.
unfortunately no. it is the same case as setup_module
37ecc98 to
72b386f
Compare
| size_t free, total; | ||
| cudaMemGetInfo(&free, &total); | ||
| if (free <= total * reserve_ / 100 || size > free - total * reserve_ / 100) | ||
| ReleaseAll(); |
There was a problem hiding this comment.
What will happen to the storage handles currently pointing to some of the memory?
| std::lock_guard<std::mutex> lock(Storage::Get()->GetMutex(Context::kGPU)); | ||
| int bucket = get_bucket(handle->size); | ||
| size_t size = get_size(bucket); | ||
| auto&& reuse_pool = memory_pool_[bucket]; |
There was a problem hiding this comment.
Even if it's no error (the rvalue reference will de deduced to normal lvalue reference) it's better to use it explicitly as auto&
|
@szha should we document this new env variable or is it still experimental? |
|
@ThomasDelteil I intended to have people experiment with this first. |
* use nearest power of 2 for gpu memory pool sizes * add linear * add test
* use nearest power of 2 for gpu memory pool sizes * add linear * add test
Description
adjust GPU memory pool strategy
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
MXNET_GPU_MEM_POOL_TYPE="Round") for using nearest power of 2 size for better memory reuseComments