Tkurth/sgbn fixes#1685
Conversation
|
@eqy, @rmhaskarnvidia please review this PR and/or suggest someone to review. I will also take a look, but I am not familiar with |
| strideA[d] = strideA[d + 1] * dimA[d + 1]; | ||
| } | ||
| strideA[0] = strideA[2] * dimA[2]; | ||
| void generateStrides(const int64_t* dimA, int64_t* strideA, int64_t nbDims, cudnnTensorFormat_t filterFormat) { |
There was a problem hiding this comment.
Is it possible to simply use the Tensor's existing .strides() rather than relying on another helper function? AFAIK it would respect the NHWC vs. NCHW convention.
There was a problem hiding this comment.
this is how it was implemented before though, I basically based the new implementation on the old one. We could of course use strides and pass the stride tensors to the routine. Do we want to change that?
| auto tensor_create = [&tensor_stride, &tensorDims](cudnnDataType_t type, | ||
| int64_t id) { | ||
| return cudnn_frontend::TensorBuilder() | ||
| .setDim(4, tensorDims) |
There was a problem hiding this comment.
Similarly, can we use the existing .sizes() instead of creating another tensorDims array?
There was a problem hiding this comment.
Same thing here, this is how this was implemented previously. We could generate all those shapes and strides in cudnn_gbn and pass them to the planning function.
| auto plan = run_batch_norm_forward(tensorDims, perChannelDims, epsilonDims, peerDims, CUDNN_DATA_HALF); | ||
| gbn_plan_cache.insert(std::make_pair(fv, plan)); | ||
| } | ||
|
|
There was a problem hiding this comment.
It looks like some of the code makes assumptions about the input tensor(s)' memory layout. If so, there should be checks like is_contiguous(at::MemoryFormat::ChannelsLast).
There was a problem hiding this comment.
This is done on the python frontend. That check is here
| .setDim(4, tensorDims) | ||
| .setStrides(4, tensor_stride) | ||
| .setId(id) | ||
| .setAlignment(16) |
There was a problem hiding this comment.
Manually setting alignment without checking the actual tensor address seems dangerous.
There was a problem hiding this comment.
the existing code has that too: this is just a refactor of the code which is already present:
https://github.com/NVIDIA/apex/blob/master/apex/contrib/csrc/cudnn_gbn/norm_sample.cpp
This is somewhat urgent since it is fixing a showstopper bug for mlperf hpc 3.0. I am fine with rewriting this but I want to move fast on this. Is there an example of how this should be done?
There was a problem hiding this comment.
crcrpar
left a comment
There was a problem hiding this comment.
wouldn't this require any changes to https://github.com/NVIDIA/apex/blob/30a7ad3974b32f7ce68cefabc38374fb4520a35e/apex/contrib/test/cudnn_gbn/test_cudnn_gbn_with_two_gpus.py?
|
@azrael417 we can defer addressing the issues I brought up to a later PR if @crcrpar is content to merge the fix given the urgency |
|
rel: #1689 |
* fixing order of class instantiation and device extraction in mixed precision lamb * this commit fixes the SGBN graph capture problem by caching the cudnn plan and re-using it * disentangling the mplamb MR and SGBN MR * cleaner caching
This PR fixes the single node group batch norm in APEX to work with cuda 12.2 and RTC.