When fine-tuning networks trained with BN sometimes we want to freeze and use the accumulated moving averages while allowing the gradients to be backpropagated through the BN layer, but currently there is no way of doing so with fused BN, since when is_training = False the layer gives erroneous gradients. Of course, we could use the batch statistics from the new task to accumulate the stats, but it isn't possible in the case of batch_size = 1.
I understand that due to the nature of the CuDNN kernel it might be hard to implement such feature, but a fused Batch Renorm layer could be a decent compromise, as it uses the moving averages when training as well as during inference.
When fine-tuning networks trained with BN sometimes we want to freeze and use the accumulated moving averages while allowing the gradients to be backpropagated through the BN layer, but currently there is no way of doing so with fused BN, since when is_training = False the layer gives erroneous gradients. Of course, we could use the batch statistics from the new task to accumulate the stats, but it isn't possible in the case of batch_size = 1.
I understand that due to the nature of the CuDNN kernel it might be hard to implement such feature, but a fused Batch Renorm layer could be a decent compromise, as it uses the moving averages when training as well as during inference.