Deep Learning
Tutorial 1
Q1. Recall the following plot of the number of stochastic gradient descent (SGD) iterations required to reach a
given loss, as a function of the batch size
A. For small batch sizes, the number of iterations required to reach the target loss decreases as the batch size
increases. Why is that?
B. For large batch sizes, the number of iterations does not change much as the batch size is increased. Why is
that?
Q2. You are developing a model to classify between tigers and leopards. As input, you pass a flattened 2 ×2 image
to each of the fully connected neural networks as shown in the below picture and conduct backpropagation across
multiple samples. There is one output. If the value of the output is >=0.5, then you classify the image as a tiger,
else, it is a leopard.
Based on the above description, answers the following questions.
A. Which neural network's structure will only allow it to classify images as tigers? Explain
B. For network 2, which activation layer can be removed without changing the numerical or classification
output of the neural network? Explain your answer.
1. Sigmoid activation between Hidden layer 1 and Hidden layer 2
2. Sigmoid activation between Hidden layer 2 and the Output layer
3. ReLU activation between Hidden layer 2 and Output layer
PTO
Q3. Consider the following neural network and the matrices below
Use these to answer the following questions
A. Using the Neural network calculate the forward propagated value for the following input:
[0.2, 0.1, 0.4, -0.5]
B. For the given Neural Network an input of [0.3, 4.3, 2.5, 0.7], produces the output [2.348, 1.732]. Find
the gradient for hidden layer 1 weights when target output = [0.5, 0.8] and loss function being used is
1
Mean squared error (MSE = ∑𝑁 2
𝑖=1(𝑦 − 𝑦′) ) .
2𝑁
Where y is the ground truth, y’ is predicted value, and N is batch size.
For example: dw1(0,0) = 0.277680. calculate other weights of Hidden layer 1.
C. Your colleague suggests replacing ReLU of Q3 neural network with function as defined below:
1 𝑓𝑜𝑟 𝑥 > 0
𝑓 (𝑥) = {
−1 𝑓𝑜𝑟 𝑥 ≤ 0
Which of the following are correct w.r.t. this new proposed activation function? Explain
1. The new proposed activation function prevents vanishing gradients.
2. The new proposed activation function prevents the dead ReLU problem.
3. The new proposed activation function will converge correctly since all the gradient values will be
the same.
4. Proposed activation function will not allow the neural network to converge.
Q4. You come across a nonlinear function that passes 1 if its input is nonnegative, else evaluates to 0, i.e.
A friend recommends you use this non-linearity in your convolutional neural network with the Adam optimizer.
Would you follow their advice? Why or why not?
PTO