Codestin Search App

azrael417 · 2023-06-22T07:45:46Z

This PR fixes the single node group batch norm in APEX to work with cuda 12.2 and RTC.

…ecision lamb

… plan and re-using it

Aidyn-A · 2023-06-22T08:00:26Z

@eqy, @rmhaskarnvidia please review this PR and/or suggest someone to review. I will also take a look, but I am not familiar with cudnn-frontend.

eqy

Took a first look, but given the size of this PR I believe @crcrpar should get the final say

eqy · 2023-06-22T19:15:40Z

-            strideA[d] = strideA[d + 1] * dimA[d + 1];
-        }
-        strideA[0] = strideA[2] * dimA[2];
+void generateStrides(const int64_t* dimA, int64_t* strideA, int64_t nbDims, cudnnTensorFormat_t filterFormat) {


Is it possible to simply use the Tensor's existing .strides() rather than relying on another helper function? AFAIK it would respect the NHWC vs. NCHW convention.

this is how it was implemented before though, I basically based the new implementation on the old one. We could of course use strides and pass the stride tensors to the routine. Do we want to change that?

eqy · 2023-06-22T19:16:51Z

+  auto tensor_create = [&tensor_stride, &tensorDims](cudnnDataType_t type,
+  int64_t id) {
+    return cudnn_frontend::TensorBuilder()
+      .setDim(4, tensorDims)


Similarly, can we use the existing .sizes() instead of creating another tensorDims array?

Same thing here, this is how this was implemented previously. We could generate all those shapes and strides in cudnn_gbn and pass them to the planning function.

eqy · 2023-06-22T19:18:57Z

+    auto plan = run_batch_norm_forward(tensorDims, perChannelDims, epsilonDims, peerDims, CUDNN_DATA_HALF);
+    gbn_plan_cache.insert(std::make_pair(fv, plan));
+  }
+


It looks like some of the code makes assumptions about the input tensor(s)' memory layout. If so, there should be checks like is_contiguous(at::MemoryFormat::ChannelsLast).

This is done on the python frontend. That check is here

eqy · 2023-06-22T19:20:18Z

+      .setDim(4, tensorDims)
+        .setStrides(4, tensor_stride)
+          .setId(id)
+            .setAlignment(16)


Manually setting alignment without checking the actual tensor address seems dangerous.

the existing code has that too: this is just a refactor of the code which is already present:

https://github.com/NVIDIA/apex/blob/master/apex/contrib/csrc/cudnn_gbn/norm_sample.cpp

This is somewhat urgent since it is fixing a showstopper bug for mlperf hpc 3.0. I am fine with rewriting this but I want to move fast on this. Is there an example of how this should be done?

Yes, e.g., https://github.com/pytorch/pytorch/blob/004ff536e87c9586064fb49c4e581f185f3a9d47/aten/src/ATen/native/cudnn/Conv_v8.cpp#L55

crcrpar

wouldn't this require any changes to https://github.com/NVIDIA/apex/blob/30a7ad3974b32f7ce68cefabc38374fb4520a35e/apex/contrib/test/cudnn_gbn/test_cudnn_gbn_with_two_gpus.py?

eqy · 2023-07-01T03:30:17Z

@azrael417 we can defer addressing the issues I brought up to a later PR if @crcrpar is content to merge the fix given the urgency

crcrpar · 2023-07-02T00:45:57Z

rel: #1689

* fixing order of class instantiation and device extraction in mixed precision lamb * this commit fixes the SGBN graph capture problem by caching the cudnn plan and re-using it * disentangling the mplamb MR and SGBN MR * cleaner caching

azrael417 added 3 commits June 21, 2023 23:40

fixing order of class instantiation and device extraction in mixed pr…

7316130

…ecision lamb

this commit fixes the SGBN graph capture problem by caching the cudnn…

dafe66a

… plan and re-using it

disentangling the mplamb MR and SGBN MR

3ad083d

eqy requested changes Jun 22, 2023

View reviewed changes

cleaner caching

96b961f

crcrpar reviewed Jun 28, 2023

View reviewed changes

crcrpar merged commit 8ffc901 into NVIDIA:master Jul 2, 2023

Conversation

azrael417 commented Jun 22, 2023

Uh oh!

Aidyn-A commented Jun 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eqy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

azrael417 Jun 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crcrpar left a comment

Choose a reason for hiding this comment

Uh oh!

eqy commented Jul 1, 2023

Uh oh!

crcrpar commented Jul 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Aidyn-A commented Jun 22, 2023 •

edited

Loading

eqy left a comment •

edited

Loading

azrael417 Jun 29, 2023 •

edited

Loading