Add kernels for FusedBatchNormGrad when is_training=False#12580
Conversation
|
Can one of the admins verify this patch? |
|
@ppwwyyxx, thanks for your PR! By analyzing the history of the files in this pull request, we identified @zhangyaobit, @tensorflower-gardener and @keveman to be potential reviewers. |
zhangyaobit
left a comment
There was a problem hiding this comment.
Thanks for the nice contribution, Yuxin!
| const Tensor& mean_input, const Tensor& variance_input, | ||
| T epsilon, Tensor* x_backprop_output, | ||
| Tensor* scale_backprop_output, Tensor* offset_backprop_output, | ||
| typename TTypes<T>::Vec scratch1, typename TTypes<T>::Vec scratch2) { |
There was a problem hiding this comment.
Are these two scratch allocation needed? Could you follow the Eigen implementation of FusedBatchNorm, where no temp allocation is needed (you can still use something like "Eigen::Tensor<T, 1, Eigen::RowMajor> mean(depth)" though).
There was a problem hiding this comment.
Seems like Eigen::Tensor<T, 1, Eigen::RowMajor> scratch1(depth) only allocate memory on CPUs? In a GPU kernel, using this ends up with CUDA_ERROR_ILLEGAL_ADDRESS. Using the OpKernelContext seems like the standard device-agnostic way to allocate memory.
| // Functor used by FusedBatchNormGradOp to do the computations when is_training=False. | ||
| // Both CPU and GPU will use this functor. | ||
| template <typename Device, typename T> | ||
| struct FusedBatchNormFreezeGrad { |
There was a problem hiding this comment.
Could you move this function to fused_batch_norm_op.cc?
There was a problem hiding this comment.
If I understand the build process correctly, this functor needs to be instantiated in both fused_batch_norm_op.cc and fused_batch_norm_op.cu.cc, to be compiled to two kernels by nvcc and gcc respectively. Therefore it has to be in a header file to be included by both .cc and .cu.cc. This seems like what's been done for other kernels as well (e.g. reverse_op).
| grad_y = op.inputs[0] | ||
| x = op.inputs[1] | ||
| scale = op.inputs[2] | ||
| pop_mean = op.inputs[3] |
There was a problem hiding this comment.
I think here you will need the input 3 and 4 of op FusedBatchNorm instead of FusedBatchNormGrad. Could you forward the pop mean and pop variance to the output 3 and 4 of FusedBatchNorm in the C++ code? This way you can also unify the two branches in the "_FusedBatchNormGrad(op, *grad)".
There was a problem hiding this comment.
pop_mean and pop_var are inputs to FusedBatchNormGrad as well, what's the reason to not use them directly like this?
There was a problem hiding this comment.
Note that input 3 and 4 are reserve_space_1 and reserve_space_2, which are not pop mean and pop var, but
reserve_space_1: A 1D Tensor for the computed batch mean, to be reused
in the gradient computation.
reserve_space_2: A 1D Tensor for the computed batch variance (inverted variance
in the cuDNN case), to be used in the gradient computation.
REGISTER_OP("FusedBatchNormGrad")
.Input("y_backprop: T")
.Input("x: T")
.Input("scale: T")
.Input("reserve_space_1: T")
.Input("reserve_space_2: T")
.Output("x_backprop: T")
.Output("scale_backprop: T")
.Output("offset_backprop: T")
.Output("reserve_space_3: T")
.Output("reserve_space_4: T")
.Attr("T: {float}")
.Attr("epsilon: float = 0.0001")
.Attr("data_format: string = 'NHWC'")
.Attr("is_training: bool = true")
......
There was a problem hiding this comment.
Ok, you did the forwarding in _FusedBatchNormGrad, instead of on the C++ side. I think that is fine too. If you go with this way, could you update the comments of reserve_space_1 and reserve_space_2 saying they are pop mean and variance when is_training is False?
There was a problem hiding this comment.
Comments were updated.
| T epsilon, Tensor* x_backprop_output, | ||
| Tensor* scale_backprop_output, Tensor* offset_backprop_output, | ||
| typename TTypes<T>::Vec scratch1, typename TTypes<T>::Vec scratch2) { | ||
| typename TTypes<T, 4>::ConstTensor out_backprop(y_backprop_input.tensor<T, 4>()); |
|
|
||
| // db = out_backprop | ||
| // dg = out_backprop * ((x - m) * rsqrt(v + epsilon)) | ||
| // dx = out_backprop * (gamma * rsqrt(v + epsilon)) |
There was a problem hiding this comment.
Rename all names to be consistent of what is used in the program?
| .eval() | ||
| .reshape(one_by_depth) | ||
| .broadcast(rest_by_one)); | ||
| scale_backprop.device(d) = scratch2 * scratch1; |
There was a problem hiding this comment.
Are what implemented above equivalent to python implementation below?
grad_offset = reduce_sum(grad_y)
grad_scale = reduce_sum(grad_y*(x-pop_mean)*var_rsqrt)
grad_x = grad_y * scale * var_rsqrt
There was a problem hiding this comment.
That looks equivalent to me
| x: A 4D Tensor for input data. | ||
| scale: A 1D Tensor for scaling factor, to scale the normalized x. | ||
| reserve_space_1: A 1D Tensor for the computed batch mean, to be reused | ||
| reserve_space_1: A 1D Tensor for the computed batch mean when is_training is True, |
There was a problem hiding this comment.
How about something like this?
A 1D Tensor for the computed batch mean when is_training is True, to be reused in the gradient computation; or the population mean when is_training is False, to be used in the second-order gradient computation.
And the same for reserve_space_2.
There was a problem hiding this comment.
When is_training is False, pop_mean/pop_variance is needed for first-order gradient computation as well.
There was a problem hiding this comment.
Ah, ok, then "to be used in the first-order and second-order gradient computation."
| epsilon_, x_backprop, scale_backprop, offset_backprop, tensor_format_); | ||
| if (is_training_) { | ||
| functor::FusedBatchNormGrad<Device, T>()( | ||
| context, y_backprop, x, scale, saved_mean, saved_maybe_inv_var, |
There was a problem hiding this comment.
This is a bit confusing. Rename to saved_mean_or_pop_mean and saved_maybe_inv_var_or_pop_var?
There was a problem hiding this comment.
Did some rename (on one existing kernel as well) to improve clarity.
|
Let's wait a bit to see if zheng-xq has any comments (This PR may affect lots of users, let's be extra careful :) ). Thanks! |
|
Any updates? |
|
Thanks for the patience, Yuxin! zheng-xq will respond soon. |
|
/CC @zheng-xq with the |
|
Jenkins, test this please. |
| // Functor used by FusedBatchNormGradOp to do the computations when is_training=False. | ||
| // Both CPU and GPU will use this functor. | ||
| template <typename Device, typename T> | ||
| struct FusedBatchNormFreezeGrad { |
There was a problem hiding this comment.
Should this be inside functor namespace? Note this test failure: tensorflow/core/kernels/fused_batch_norm_op.cc:645:7: error: 'FusedBatchNormFreezeGrad' is not a member of 'tensorflow::functor'
at https://ci.tensorflow.org/job/tensorflow-pull-requests-cpu-python3/6452/console
There was a problem hiding this comment.
Looks like this header is only included when GPU is enabled. I'll fix it.
There was a problem hiding this comment.
Ah this is more complicated than I thought. Looks like I need to change BUILD file somehow.
|
Jenkins, test this please. |
|
Jenkins, test this please. |
|
Please merge this change. Thanks! |
#10857