Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views11 pages

PDF Hyperparameter Tuning Batch Normalization

Uploaded by

Mai Gado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views11 pages

PDF Hyperparameter Tuning Batch Normalization

Uploaded by

Mai Gado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Constant Learning Rate

Until now, throughout the training process, the learning rate remained constant.

When the path taken by gradient descent is approaching minimum, it might bypass the minimum just because
the gradient is large.

This will take additional steps to return to the minimum.

When learning rate is constant, the gradient descent tends to oscillate when it is about to converge.

The training can start with a large learning rate since randomized weights will be far from the optimal. At later
stages, the learning rate can be decreased to allow more fine-grained weight updates.

Batch Normalization
From the previous course Building Effective Deep Neural Network, you have seen how normalizing inputs
helps to train network faster.

This concept can be extended to each layer of the network by normalizing the output of the previous layer
before feeding to the next layer. This technique is known as batch normalization.

Covariant Shift
The covariant shift is one of the major problems solved by batch normalization.

For example, if you have trained a network to identify human faces using grayscale images and then if you
test your model on colored images, the network might not perform well since there is a large difference in
pixel values between the train and test data.

This problem is known as covariant shift where there is a shift in the data distribution, but the ground truth
remains the same.

Internal Covariant Shift


Covariant shifts can also happen between the layers of the network when data flows across the layers.

In the case of mini batch gradient descent, since each batch is made of a set of random samples the current
batch might have a different distribution compared to the previous mini batch.

This change in distribution might reflect in the output of subsequent layers. Otherwise, when the parameters
of previous layers get updated, it also changes the input distribution for the current layer.

In batch normalization, each layer makes sure that its input distribution remains the same by normalizing
before performing activation.

For current mini-batch


How it Works?

The equations shown in the previous card do the following operations:

Calculate the mean of the minibatch.

Calculate variance of the minibatch.

Calculate norm by subtracting the mean from Z and dividing by the standard deviation .A
small number, epsilon , is added to the denominator to prevent divide by zero. Now the distribution has
zero mean and unit variance.

Calculate by multiplying norm with a scale and adding a shift and use in
place of Z as the nonlinearity (e.g. ReLU’s) input. The two parameters and are learned during the
training process with parameters W and b.

Parameters vs Hyperparameters

Till now, you have come across several parameters and hyperparameters.

Parameters are the one that is learned by the network by performing gradient descent.

Weights W, bias b, scaling parameters and (the one which you learned in batch normalization) are the
parameters that you initialize randomly and leave it for the network to learn.

Hyperparameters, on the other hand, cannot be learned from the data but has to be tried on different values
within some range till we get the model right.

Hyperparameters

Some of the important hyperparameters you have learned so far are:

learning rate

parameter for the gradient with momentum

number of nodes in each layer

number of layers

mini-batch size

, and with respect to adam optimizer

Selecting Hyperparameters

When training the model, you have to try on various combinations of parameters and come up with one set of
parameters on which the model performs its best.

Grid Search In a grid search, you will arrange the parameters in the form of a matrix and try out each
combination to train your model.

In a real scenario, when trying out more than two hyperparameters and matrix can be multidimensional.
The main problem with this approach is that you will end up training the number of models having the same
accuracy.

This works well when the number of hyperparameters is small.

Random Search In this approach instead of iterating through each of the combinations, we randomly select a
limited number of combination to train our model.

Though you are not iterating over all possible combination the chances of selecting the best combination will
be high.

This is helpful when you have a large number of parameters to tune.

Choosing Appropriate Scale

Hyperparameters like a number of nodes, a number of layers can be searched on a linear scale since their
range is very small.

The model's performance is sensitive for a small change in values or hence searching for them on
linear sale would be a bad idea.

GD with momentum: to avoid too many oscillations in the path taken by gradient descent.

RmsProp: a technique to have a balanced step size - decreases the step size in case of a larger gradient and
increases the step size for vanishing gradient.

Adam optimizer: an algorithm to combine the elements of momentum and RMSProp.

Batch normalization: to prevent covariant shift by normalizing inputs of activation.

Hyperparameter tuning: methods to search for optimal hyperparameters.

Learning rate decay: how to have control over the learning rate to prevent large gradient steps at during

Welcome to the first handson!!!

In this handson you will be building an deep neural network network by integrating batch normalization
You will also be implementing minibatch gradient and L2 regularization to train you network
Follow the instruction provided for cell to write the code in each cell.
Run the below cell for to import necessary packages to read and visualize data

In [1]: import pandas as pd


import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors

The data is provided as file named 'data.csv'.


Using pandas read the csv file and assign the resulting dataframe to variable 'data'
for example if file name is 'xyz.csv' read file as pd.read_csv('xyz.csv')
In [2]: data = pd.read_csv('../input/data (1).csv')
data.head()

Out[2]:
feature1 feature2 target

0 -0.260842 0.965382 0.0

1 0.880000 0.000000 1.0

2 -0.942991 -0.332820 0.0

3 0.309017 0.951057 0.0

4 -0.691934 -0.543716 1.0

Extract feature1 and feature2 values from dataframe 'df' and assign it to variable 'X'
Extract target variable 'traget' and assign it to variable 'y'.
Hint:
Use .values to exract values from dataframe

In [3]: X = data.loc[:, data.columns != 'target'].values


y = data['target'].values

Run the below cell to visualize the data in x-y plane. (visualization code has been written for you)
The green spots corresponds to target value 0 and blue spots corresponds to target value 1

In [4]: colors=['green','blue']
cmap = matplotlib.colors.ListedColormap(colors)
#Plot the figure
plt.figure()
plt.title('Non-linearly separable classes')
plt.scatter(X[:, 0], X[:, 1], marker='o', c=y, cmap=cmap,
s=25, edgecolor='k')
plt.show()
In [5]: from pandas.plotting import scatter_matrix
%matplotlib inline
color_wheel = {0: "#0392cf",
1: "#7bc043",
}

colors_mapped = data["target"].map(lambda x: color_wheel.get(x))

axes_matrix = scatter_matrix(data.loc[:, data.columns != 'target'], alpha =


0.2, figsize = (10, 10), color=colors_mapped )

In order to feed the network the input has to be of shape (number of features, number of samples)
and target should be of shape (1, number of samples)
Transpose X and assign it to variable 'X_data'
reshape y to have shape (1, number of samples) and assign to variable 'y_data'
In [6]: X_data = X.T
y_data = y.reshape(1, -1)

assert X_data.shape == (2, 1000)


assert y_data.shape == (1, 1000)

Define the network dimension to have two input features, four hidden layers with 20 nodes each, one output
node at final layer.

In [7]: layer_dims = [2, 20, 20, 20, 20, 1]

Import tensorflow as tf

In [8]: import tensorflow as tf

Define a function named placeholders to return two placeholders one for input data as A_0 and one for output
data as Y.

Set the datatype of placeholders as float32


parameters - num_features
Returns - A_0 with shape (num_feature, None) and Y with shape(1,None)

In [9]: def placeholders(num_features):

A_0 = tf.placeholder( shape=([num_features, None]), dtype=tf.float32)


Y = tf.placeholder(shape=([1,None]), dtype=tf.float32)

return A_0,Y

define function named initialize_parameters_deep() to initialize weights and bias for each layer

Use tf.get_variable to initialise weights and bias, set datatype as float32


Make sure you are using xavier initialization for weigths and initialize bias to zeros
Parameters - layer_dims
Returns - dictionary of weights and bias
In [10]: def initialize_parameters_deep(layer_dims):
tf.set_random_seed(1)
L = len(layer_dims)
parameters = {}
for l in range(1,L):

parameters['W' + str(l)] = tf.get_variable('W'+ str(l), shape=([lay


er_dims[l], layer_dims[l-1]]), dtype=tf.float32,
initializer=tf.contrib.la
yers.xavier_initializer())
parameters['b' + str(l)] = tf.get_variable('b'+ str(l), shape=([lay
er_dims[l], 1]), dtype=tf.float32, initializer=tf.zeros_initializer())

return parameters

Define functon named linear_forward_prop() to define forward propagation for a given layer.

parameters: A_prev(output from previous layer), W(weigth matrix of current layer), b(bias vector for
current layer),activation(type of activation to be used for out of current layer)
returns: A(output from the current layer)
Use relu activation for hidden layers and for final output layer return the output unactivated i.e if activation
is sigmoid
After computing linear output Z implement batch normalization before feeding to activation function, set
traing = True and axis = 0

In [11]: def linear_forward_prop(A_prev,W,b, activation):

Z = tf.add(tf.matmul( W, A_prev), b)
#call linear_fowrward prop
Z = tf.layers.batch_normalization(inputs=Z, axis=0, training=True, gamm
a_initializer=tf.ones_initializer(),
beta_initializer=tf.zeros_initializer
())
#implement batch normalization on Z

if activation == "sigmoid":
A = Z
elif activation == "relu":
A = tf.nn.relu(Z)
return A

Define forward propagation for entire network as l_layer_forward()

Parameters: A_0(input data), parameters(dictionary of weights and bias)


returns: A(output from final layer)
In [12]: def l_layer_forwardProp(A_0, parameters):
A = A_0
L = len(parameters)//2
for l in range(1,L):
A_prev = A

A = linear_forward_prop(A_prev, parameters['W' + str(l)], parameter


s['b' + str(l)], activation='relu' )
#call linear forward prop with relu activation
A = linear_forward_prop(A, parameters['W' +str(L)], parameters['b' + st
r(L)], activation='sigmoid')
#call linear forward prop with sigmoid activation

return A

Define the cost function


parameters:
Z_final: output fro final layer
Y: actual output
parameters: dictionary of weigths and bias
regularization : boolean
lambd: regularization parameter
First define the original cost using tensoflow's sigmoid_cross_entropy function
If regularization == True add regularization term to original cost function

In [13]: def final_cost(Z_final, Y , parameters, regularization = False, lambd = 0):


cost = tf.nn.sigmoid_cross_entropy_with_logits(logits=Z_final,labels=Y)
if regularization:
reg_term = 0
L = len(parameters)//2
for l in range(1,L+1):

reg_term += tf.nn.l2_loss(parameters['W'+str(l)]) #
add L2 loss term

cost = cost + (lambd/2) * reg_term


return tf.reduce_mean(cost)

Define the function to generate mini-batches.


In [14]: import numpy as np
def random_samples_minibatch(X, Y, batch_size, seed = 1):
np.random.seed(seed)

m = X.shape[1] #number of sampl


es
num_batches =int( m / batch_size) #numb
er of batches derived from batch_size

indices = np.random.permutation(m) # g
enerate ramdom indicies
shuffle_X = X[:,indices]
shuffle_Y = Y[:,indices]
mini_batches = []

#generate minibatch
for i in range(num_batches):
X_batch = shuffle_X[ :, i * batch_size:(i+1) * batch_size]
Y_batch = shuffle_Y[ :, i * batch_size:(i+1) * batch_size]

assert X_batch.shape == (X.shape[0], batch_size)


assert Y_batch.shape == (Y.shape[0], batch_size)

mini_batches.append((X_batch, Y_batch))

#generate batch with remaining number of samples


if m % batch_size != 0:
X_batch = shuffle_X[ :, (num_batches * batch_size): ]
Y_batch = shuffle_Y[:, (num_batches * batch_size): ]
mini_batches.append((X_batch, Y_batch))
return mini_batches

Define the model to train the network using minibatch

parameters:
X_train, Y_train: input and target data
layer_dims: network configuration
learning_rate
num_iter: number of epoches
mini_batch_size: number of samples to be considered in each minibatch
return: dictionary of trained parameters
In [15]: def model_with_minibatch(X_train,Y_train, layer_dims, learning_rate,num_ite
r, mini_batch_size):
tf.reset_default_graph()
num_features, num_samples = X_train.shape

A_0, Y = placeholders(num_features)
#call placeholder function to initialize placeholders A_0 and Y
parameters = initialize_parameters_deep(layer_dims)
#Initialse Weights and bias using initialize_parameters
Z_final = l_layer_forwardProp(A_0, parameters)
#call the function l_layer_forwardProp() to define the final output

cost = final_cost(Z_final, Y , parameters, regularization = True)


#call the final_cost function with regularization set TRUE

#use adam optimization to train the network


train_net = tf.train.AdamOptimizer(learning_rate, beta1=0.9, beta2=0.99
9).minimize(cost)

seed = 1
num_minibatches = int(num_samples / mini_batch_size)
init = tf.global_variables_initializer()
costs = []
with tf.Session() as sess:
sess.run(init)
for epoch in range(num_iter):
epoch_cost = 0

mini_batches = random_samples_minibatch(X_train, Y_train, mini


_batch_size, seed)
#call random_sample_minibatch to return minibatches

seed = seed + 1

#perform gradient descent for each mini-batch


for mini_batch in mini_batches:

X_batch, Y_batch = mini_batch


#assign minibatch

_,mini_batch_cost = sess.run([train_net, cost], feed_dict=


{A_0: X_batch, Y: Y_batch})

epoch_cost += mini_batch_cost/num_minibatches
if epoch % 2 == 0:
costs.append(epoch_cost)
if epoch % 100 == 0:
print(epoch_cost)
with open("output.txt", "w+") as file:
file.write("%f" % epoch_cost)
plt.ylim(0 ,2, 0.0001)
plt.xlabel("epoches per 2")
plt.ylabel("cost")
plt.plot(costs)
plt.show()
params = sess.run(parameters)
return params

train the model using the above defined function

Use X_data and y_data as training input, learning rate = 0.001, numiteration = 1000
minibatch size = 256
Return the trained parameters to variable parameters

In [16]: parameters = model_with_minibatch(X_data,y_data, layer_dims, learning_rate


=0.001,num_iter=1000, mini_batch_size=256)

1.0600777665774028
0.3384200731913249
0.2255556285381317
0.1712941179672877
0.1367433468500773
0.10704284906387329
0.08675921087463698
0.06883981203039488
0.05524631341298421
0.04730481530229251

In [ ]:

In [ ]:

In [ ]:

In [ ]:

You might also like