Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add kernels for FusedBatchNormGrad when is_training=False#12580

Merged
caisq merged 7 commits into
tensorflow:masterfrom
ppwwyyxx:master
Sep 22, 2017
Merged

Add kernels for FusedBatchNormGrad when is_training=False#12580
caisq merged 7 commits into
tensorflow:masterfrom
ppwwyyxx:master

Conversation

@ppwwyyxx
Copy link
Copy Markdown
Contributor

@tensorflow-jenkins
Copy link
Copy Markdown
Collaborator

Can one of the admins verify this patch?

@mention-bot
Copy link
Copy Markdown

@ppwwyyxx, thanks for your PR! By analyzing the history of the files in this pull request, we identified @zhangyaobit, @tensorflower-gardener and @keveman to be potential reviewers.

Copy link
Copy Markdown

@zhangyaobit zhangyaobit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice contribution, Yuxin!

const Tensor& mean_input, const Tensor& variance_input,
T epsilon, Tensor* x_backprop_output,
Tensor* scale_backprop_output, Tensor* offset_backprop_output,
typename TTypes<T>::Vec scratch1, typename TTypes<T>::Vec scratch2) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two scratch allocation needed? Could you follow the Eigen implementation of FusedBatchNorm, where no temp allocation is needed (you can still use something like "Eigen::Tensor<T, 1, Eigen::RowMajor> mean(depth)" though).

Copy link
Copy Markdown
Contributor Author

@ppwwyyxx ppwwyyxx Aug 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like Eigen::Tensor<T, 1, Eigen::RowMajor> scratch1(depth) only allocate memory on CPUs? In a GPU kernel, using this ends up with CUDA_ERROR_ILLEGAL_ADDRESS. Using the OpKernelContext seems like the standard device-agnostic way to allocate memory.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, sounds good to me.

// Functor used by FusedBatchNormGradOp to do the computations when is_training=False.
// Both CPU and GPU will use this functor.
template <typename Device, typename T>
struct FusedBatchNormFreezeGrad {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this function to fused_batch_norm_op.cc?

Copy link
Copy Markdown
Contributor Author

@ppwwyyxx ppwwyyxx Aug 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the build process correctly, this functor needs to be instantiated in both fused_batch_norm_op.cc and fused_batch_norm_op.cu.cc, to be compiled to two kernels by nvcc and gcc respectively. Therefore it has to be in a header file to be included by both .cc and .cu.cc. This seems like what's been done for other kernels as well (e.g. reverse_op).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

grad_y = op.inputs[0]
x = op.inputs[1]
scale = op.inputs[2]
pop_mean = op.inputs[3]
Copy link
Copy Markdown

@zhangyaobit zhangyaobit Aug 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here you will need the input 3 and 4 of op FusedBatchNorm instead of FusedBatchNormGrad. Could you forward the pop mean and pop variance to the output 3 and 4 of FusedBatchNorm in the C++ code? This way you can also unify the two branches in the "_FusedBatchNormGrad(op, *grad)".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pop_mean and pop_var are inputs to FusedBatchNormGrad as well, what's the reason to not use them directly like this?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that input 3 and 4 are reserve_space_1 and reserve_space_2, which are not pop mean and pop var, but
reserve_space_1: A 1D Tensor for the computed batch mean, to be reused
in the gradient computation.
reserve_space_2: A 1D Tensor for the computed batch variance (inverted variance
in the cuDNN case), to be used in the gradient computation.

REGISTER_OP("FusedBatchNormGrad")
.Input("y_backprop: T")
.Input("x: T")
.Input("scale: T")
.Input("reserve_space_1: T")
.Input("reserve_space_2: T")
.Output("x_backprop: T")
.Output("scale_backprop: T")
.Output("offset_backprop: T")
.Output("reserve_space_3: T")
.Output("reserve_space_4: T")
.Attr("T: {float}")
.Attr("epsilon: float = 0.0001")
.Attr("data_format: string = 'NHWC'")
.Attr("is_training: bool = true")
......

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, you did the forwarding in _FusedBatchNormGrad, instead of on the C++ side. I think that is fine too. If you go with this way, could you update the comments of reserve_space_1 and reserve_space_2 saying they are pop mean and variance when is_training is False?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments were updated.

T epsilon, Tensor* x_backprop_output,
Tensor* scale_backprop_output, Tensor* offset_backprop_output,
typename TTypes<T>::Vec scratch1, typename TTypes<T>::Vec scratch2) {
typename TTypes<T, 4>::ConstTensor out_backprop(y_backprop_input.tensor<T, 4>());
Copy link
Copy Markdown

@zhangyaobit zhangyaobit Aug 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to y_backprop?


// db = out_backprop
// dg = out_backprop * ((x - m) * rsqrt(v + epsilon))
// dx = out_backprop * (gamma * rsqrt(v + epsilon))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename all names to be consistent of what is used in the program?

.eval()
.reshape(one_by_depth)
.broadcast(rest_by_one));
scale_backprop.device(d) = scratch2 * scratch1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are what implemented above equivalent to python implementation below?
grad_offset = reduce_sum(grad_y)
grad_scale = reduce_sum(grad_y*(x-pop_mean)*var_rsqrt)
grad_x = grad_y * scale * var_rsqrt

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks equivalent to me

Comment thread tensorflow/core/ops/nn_ops.cc Outdated
x: A 4D Tensor for input data.
scale: A 1D Tensor for scaling factor, to scale the normalized x.
reserve_space_1: A 1D Tensor for the computed batch mean, to be reused
reserve_space_1: A 1D Tensor for the computed batch mean when is_training is True,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like this?

A 1D Tensor for the computed batch mean when is_training is True, to be reused in the gradient computation; or the population mean when is_training is False, to be used in the second-order gradient computation.

And the same for reserve_space_2.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is_training is False, pop_mean/pop_variance is needed for first-order gradient computation as well.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, then "to be used in the first-order and second-order gradient computation."

epsilon_, x_backprop, scale_backprop, offset_backprop, tensor_format_);
if (is_training_) {
functor::FusedBatchNormGrad<Device, T>()(
context, y_backprop, x, scale, saved_mean, saved_maybe_inv_var,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing. Rename to saved_mean_or_pop_mean and saved_maybe_inv_var_or_pop_var?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some rename (on one existing kernel as well) to improve clarity.

Copy link
Copy Markdown

@zhangyaobit zhangyaobit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Thanks!

@zhangyaobit
Copy link
Copy Markdown

Let's wait a bit to see if zheng-xq has any comments (This PR may affect lots of users, let's be extra careful :) ). Thanks!

@yifeif yifeif requested a review from zheng-xq September 8, 2017 00:10
@yifeif yifeif added the awaiting review Pull request awaiting review label Sep 8, 2017
@ppwwyyxx
Copy link
Copy Markdown
Contributor Author

Any updates?

@zhangyaobit
Copy link
Copy Markdown

Thanks for the patience, Yuxin! zheng-xq will respond soon.

@drpngx
Copy link
Copy Markdown
Contributor

drpngx commented Sep 17, 2017

/CC @zheng-xq with the @ sign to trigger notification.

@drpngx
Copy link
Copy Markdown
Contributor

drpngx commented Sep 17, 2017

Jenkins, test this please.

// Functor used by FusedBatchNormGradOp to do the computations when is_training=False.
// Both CPU and GPU will use this functor.
template <typename Device, typename T>
struct FusedBatchNormFreezeGrad {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be inside functor namespace? Note this test failure: tensorflow/core/kernels/fused_batch_norm_op.cc:645:7: error: 'FusedBatchNormFreezeGrad' is not a member of 'tensorflow::functor'

at https://ci.tensorflow.org/job/tensorflow-pull-requests-cpu-python3/6452/console

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this header is only included when GPU is enabled. I'll fix it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is more complicated than I thought. Looks like I need to change BUILD file somehow.

@zhangyaobit
Copy link
Copy Markdown

Jenkins, test this please.

@drpngx
Copy link
Copy Markdown
Contributor

drpngx commented Sep 20, 2017

Jenkins, test this please.

@zhangyaobit zhangyaobit removed the request for review from zheng-xq September 21, 2017 23:46
@zhangyaobit
Copy link
Copy Markdown

Please merge this change. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review Pull request awaiting review cla: yes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants