Codestin Search App

szha · 2018-05-24T04:12:52Z

Description

adjust GPU memory pool strategy

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

add knob for minimum memory pool chunk size
add option (MXNET_GPU_MEM_POOL_TYPE="Round") for using nearest power of 2 size for better memory reuse

Comments

fixes Bug of CuDNN RNN with variable sequence length #10453 when using MXNET_GPU_MEM_POOL_TYPE="Round". Before the change, memory size must be exact match to reuse the chunk in memory pool. For the workload in Bug of CuDNN RNN with variable sequence length #10453, it required cudaMalloc for 55.45GB, whereas with the rounding the cudaMalloc call reduced to 1.32GB, during which the memory usage largely stayed the same. It also helped speed up workloads and improve stability with variations in size that cannot be hybridized yet.

eric-haibin-lin · 2018-05-25T20:17:25Z

+              LOG(INFO) << "Using GPUPooledRoundedStorageManager.";
+            } else {
+              if (strategy != "Naive") {
+                LOG(INFO) << "Unknown memory pool strategy specified: " << strategy << ".";


log(fatal)?

zhreshold · 2018-05-29T18:38:47Z

Still no clue what's going wrong with this PR. Nothing specific to windows, weirdly python2-GPU-win is good.
I will try it on a local windows pc.

piiswrong · 2018-05-30T00:47:23Z

 private:
  void DirectFreeNoLock(Storage::Handle handle) {
    cudaError_t err = cudaFree(handle.dptr);
-    size_t size = handle.size + NDEV;


are you sure + NDEV is not needed any more? what if NDEV=32 and min_chunk=33 and handle.size=30? Original code would allocate 62. New code would allocate 33

cc'd @ptrendx. My understanding on this was that there needs to be enough bytes to make sure that for 32 devices at least each device has 1 byte, for nccl scattering. Could you confirm, @ptrendx?

Yes, that is correct.

piiswrong · 2018-05-30T00:47:37Z

   */
  GPUPooledStorageManager() {
    reserve_ = dmlc::GetEnv("MXNET_GPU_MEM_POOL_RESERVE", 5);
+    min_chunk_ = dmlc::GetEnv("MXNET_GPU_MEM_POOL_MIN_CHUNK", 4096);


page size instead of min chunk?

piiswrong · 2018-05-30T00:47:49Z

  void ReleaseAll();
  // used memory
-  size_t used_memory_ = 0;
+  size_t used_memory_ = 0, min_chunk_;


piiswrong · 2018-05-30T00:54:25Z

+ private:
+#if __SIZEOF_SIZE_T__ == __SIZEOF_LONG__
+
+#if defined(__clang__) || defined(__GNUC__)


does this need to be so complicated? You just need to take the highest bit and shift left by 1 if it's smaller than size.

This is called the finding the MSB. See https://www.google.com/search?ei=__UNW-DMG6iF0wLqyr4g&q=how+to+find+most+significant+bit+in+c&oq=take+highest+bit&gs_l=psy-ab.1.0.0i71k1l8.0.0.0.4417.0.0.0.0.0.0.0.0..0.0....0...1c..64.psy-ab..0.0.0....0.LUbIFjlZyeU

these builtins would utilize hardware instructions when available.

Is it really faster? It looks too complicated.

also the default implementation with pow and log is really slow

I will change the default implementation to use bit shifting and then do a comparison

I compared my current solution, the bit shifting, and static_cast<int>(std::ceil(std::log2(s))), with -O3 is turned on on my mac (clang), the speed looks like the following:

Running 10000000 iters. Addr width 64 It took me 0.00981569 seconds. result: 223222785 It took me 0.128623 seconds. result: 223222785 It took me 0.0801588 seconds. result: 223222785

szha · 2018-06-06T20:28:24Z

I've simplified the implementation to exclude optimization using intrinsics and bit scans. They are backed up in https://github.com/szha/mxnet/tree/mem_strategy_backup

piiswrong · 2018-06-07T00:43:04Z


 blacklist = [
-    'Windows.h', 'cublas_v2.h', 'cuda/tensor_gpu-inl.cuh',
+    'Windows.h', 'intrin.h', 'cublas_v2.h', 'cuda/tensor_gpu-inl.cuh',


marcoabreu · 2018-06-09T07:48:55Z

+
+TEST(GPUStorage, Round_GPU) {
+  if (mxnet::test::unitTestsWithCuda) {
+    putenv("MXNET_GPU_MEM_POOL_ROUND_LINEAR_CUTOFF=20");


How long does this variable persist? It could have side effects on other tests

marcoabreu · 2018-06-09T07:51:20Z

 #include <mxnet/storage.h>
 #include <cstdio>
 #include "test_util.h"
+#include "storage/pooled_storage_manager.h"


Duplicate import? I think it's already part of the storage namespace at mxnet/storage.h

Didn't want to block

marcoabreu · 2018-06-11T16:24:15Z


 from mxnet.test_utils import *
-from common import setup_module, with_seed
+from common import setup_module, with_seed, teardown


Is it really necessary to import this in every single test? Looks a bit ugly tbh

applying this change would allow all tests within a module to finish before moving onto the next test, thus eliminating the case where side effect of tests in another module spills over to the next. In terms of testing practice, including a setup/teardown is common.

Yeah, but we're not actually using it in most files, right?

Ah in common.py :) But isn't it sufficient to import it there?

unfortunately no. it is the same case as setup_module

lebeg · 2018-06-12T08:52:26Z

+    size_t free, total;
+    cudaMemGetInfo(&free, &total);
+    if (free <= total * reserve_ / 100 || size > free - total * reserve_ / 100)
+      ReleaseAll();


What will happen to the storage handles currently pointing to some of the memory?

lebeg · 2018-06-12T08:53:54Z

+  std::lock_guard<std::mutex> lock(Storage::Get()->GetMutex(Context::kGPU));
+  int bucket = get_bucket(handle->size);
+  size_t size = get_size(bucket);
+  auto&& reuse_pool = memory_pool_[bucket];


Even if it's no error (the rvalue reference will de deduced to normal lvalue reference) it's better to use it explicitly as auto&

ThomasDelteil · 2018-06-24T04:41:09Z

@szha should we document this new env variable or is it still experimental?

szha · 2018-06-25T00:21:38Z

@ThomasDelteil I intended to have people experiment with this first.

* use nearest power of 2 for gpu memory pool sizes * add linear * add test

szha force-pushed the mem_strategy branch 10 times, most recently from fd64b96 to b8b942e Compare May 25, 2018 03:10

szha changed the title ~~[WIP] gpu mem pool strategy~~ gpu mem pool strategy May 25, 2018

eric-haibin-lin reviewed May 25, 2018

View reviewed changes

szha force-pushed the mem_strategy branch 2 times, most recently from bcba6e2 to de2a823 Compare May 25, 2018 21:26

piiswrong suggested changes May 30, 2018

View reviewed changes

szha force-pushed the mem_strategy branch 10 times, most recently from 0319b42 to 63aac3f Compare June 4, 2018 20:39

piiswrong suggested changes Jun 7, 2018

View reviewed changes

szha force-pushed the mem_strategy branch from 8d72d62 to f3e053b Compare June 7, 2018 03:09

piiswrong approved these changes Jun 7, 2018

View reviewed changes

szha force-pushed the mem_strategy branch 2 times, most recently from e57bae9 to 9b39b72 Compare June 8, 2018 22:11

marcoabreu previously requested changes Jun 9, 2018

View reviewed changes

szha force-pushed the mem_strategy branch 2 times, most recently from d0d8bf7 to 00086f1 Compare June 11, 2018 02:17

marcoabreu reviewed Jun 11, 2018

View reviewed changes

szha force-pushed the mem_strategy branch 6 times, most recently from 37ecc98 to 72b386f Compare June 12, 2018 03:08

lebeg reviewed Jun 12, 2018

View reviewed changes

szha force-pushed the mem_strategy branch from 72b386f to 590ffbc Compare June 12, 2018 16:43

leezu mentioned this pull request Jun 12, 2018

Add embedding training model dmlc/gluon-nlp#136

Merged

4 tasks

szha force-pushed the mem_strategy branch from 590ffbc to 7e0f2c1 Compare June 12, 2018 22:04

Sheng Zha added 2 commits June 12, 2018 18:04

use nearest power of 2 for gpu memory pool sizes

542f382

add linear

e6f3f56

szha force-pushed the mem_strategy branch from 7e0f2c1 to e7943aa Compare June 13, 2018 04:34

add test

e7943aa

szha merged commit bf26886 into apache:master Jun 14, 2018

leezu mentioned this pull request Jun 22, 2018

Word embeddings update dmlc/gluon-nlp#159

Merged

9 tasks

szha deleted the mem_strategy branch June 25, 2018 00:21

zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018

gpu mem pool strategy (apache#11041)

d281019

* use nearest power of 2 for gpu memory pool sizes * add linear * add test

This was referenced Aug 17, 2018

Memory optimization in GLUON #12226

Open

Does memonger work for gluon to save memory? #10382

Closed

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018

gpu mem pool strategy (apache#11041)

1d6f107

* use nearest power of 2 for gpu memory pool sizes * add linear * add test

Conversation

szha commented May 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Essentials

Changes

Comments

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhreshold commented May 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szha commented Jun 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marcoabreu Jun 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThomasDelteil commented Jun 24, 2018

Uh oh!

szha commented Jun 25, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

szha commented May 24, 2018 •

edited

Loading

marcoabreu Jun 9, 2018 •

edited

Loading