Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
26 views8 pages

Deep Learning Assignment 2 Solutions

The document presents solutions for a deep learning assignment involving shallow neural networks, covering binary classification, multi-example training, and vectorized implementation. It details the forward and backward propagation processes, including calculations for weights, biases, and cost functions using ReLU and sigmoid activations. Additionally, it discusses the advantages of the tanh activation function and provides parameter updates based on gradient descent.

Uploaded by

salamat ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views8 pages

Deep Learning Assignment 2 Solutions

The document presents solutions for a deep learning assignment involving shallow neural networks, covering binary classification, multi-example training, and vectorized implementation. It details the forward and backward propagation processes, including calculations for weights, biases, and cost functions using ReLU and sigmoid activations. Additionally, it discusses the advantages of the tanh activation function and provides parameter updates based on gradient descent.

Uploaded by

salamat ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Deep Learning Assignment 2: Shallow Neural Networks -

Solutions
Problem 1: Binary Classification with ReLU and Sigmoid (25 points)

Given:
Architecture: [2, 3, 1]
W⁽¹⁾ = [[0.6, -0.4], [-0.2, 0.8], [0.5, 0.3]], b⁽¹⁾ = [0.2, -0.1, 0.0]ᵀ

W⁽²⁾ = [0.7, -0.3, 0.9], b⁽²⁾ = 0.1


X = [[1.5, -0.8], [2.0, 1.2]], Y = [1, 0]

Part A: Forward Propagation (10 points)


Step 1: Compute Z⁽¹⁾ and A⁽¹⁾

Z⁽¹⁾ = W⁽¹⁾X + b⁽¹⁾

For example 1 (x₁ = [1.5, 2.0]):

Z⁽¹⁾₁ = [0.6×1.5 + (-0.4)×2.0 + 0.2] = [0.9 - 0.8 + 0.2] = [0.3]


Z⁽¹⁾₂ = [-0.2×1.5 + 0.8×2.0 + (-0.1)] = [-0.3 + 1.6 - 0.1] = [1.2]
Z⁽¹⁾₃ = [0.5×1.5 + 0.3×2.0 + 0.0] = [0.75 + 0.6] = [1.35]

For example 2 (x₂ = [-0.8, 1.2]):

Z⁽¹⁾₁ = [0.6×(-0.8) + (-0.4)×1.2 + 0.2] = [-0.48 - 0.48 + 0.2] = [-0.76]


Z⁽¹⁾₂ = [-0.2×(-0.8) + 0.8×1.2 + (-0.1)] = [0.16 + 0.96 - 0.1] = [1.02]
Z⁽¹⁾₃ = [0.5×(-0.8) + 0.3×1.2 + 0.0] = [-0.4 + 0.36] = [-0.04]

Z⁽¹⁾ = [[0.3, -0.76], [1.2, 1.02], [1.35, -0.04]]

Apply ReLU: A⁽¹⁾ = max(0, Z⁽¹⁾)


A⁽¹⁾ = [[0.3, 0], [1.2, 1.02], [1.35, 0]]

Step 2: Compute Z⁽²⁾ and A⁽²⁾

Z⁽²⁾ = W⁽²⁾A⁽¹⁾ + b⁽²⁾

For example 1:
Z⁽²⁾₁ = 0.7×0.3 + (-0.3)×1.2 + 0.9×1.35 + 0.1 = 0.21 - 0.36 + 1.215 + 0.1 = 1.165
For example 2:
Z⁽²⁾₂ = 0.7×0 + (-0.3)×1.02 + 0.9×0 + 0.1 = 0 - 0.306 + 0 + 0.1 = -0.206

Z⁽²⁾ = [1.165, -0.206]

Apply Sigmoid: A⁽²⁾ = 1/(1 + e⁻ᶻ⁽²⁾)


A⁽²⁾₁ = 1/(1 + e⁻¹·¹⁶⁵) = 1/(1 + 0.312) = 0.762
A⁽²⁾₂ = 1/(1 + e⁰·²⁰⁶) = 1/(1 + 1.229) = 0.449

A⁽²⁾ = [0.762, 0.449]

Step 3: Calculate Cost J

J = -1/m ∑[y⁽ⁱ⁾log(a⁽²⁾⁽ⁱ⁾) + (1-y⁽ⁱ⁾)log(1-a⁽²⁾⁽ⁱ⁾)]

J = -1/2 [1×log(0.762) + 0×log(1-0.762) + 0×log(0.449) + 1×log(1-0.449)]


J = -1/2 [log(0.762) + log(0.551)]
J = -1/2 [-0.272 + (-0.596)] = -1/2 × (-0.868) = 0.434

Part B: Backward Propagation (12 points)


Step 1: Output layer gradients

dZ⁽²⁾ = A⁽²⁾ - Y = [0.762, 0.449] - [1, 0] = [-0.238, 0.449]

dW⁽²⁾ = 1/m × dZ⁽²⁾ × (A⁽¹⁾)ᵀ


dW⁽²⁾ = 1/2 × [-0.238, 0.449] × [[0.3, 0], [1.2, 1.02], [1.35, 0]]ᵀ
dW⁽²⁾ = 1/2 × [-0.238×0.3 + 0.449×0, -0.238×1.2 + 0.449×1.02, -0.238×1.35 + 0.449×0]
dW⁽²⁾ = 1/2 × [-0.071, 0.172, -0.321] = [-0.036, 0.086, -0.161]

db⁽²⁾ = 1/m × sum(dZ⁽²⁾) = 1/2 × (-0.238 + 0.449) = 0.106

Step 2: Hidden layer gradients

dZ⁽¹⁾ = (W⁽²⁾)ᵀ × dZ⁽²⁾ ⊙ g'⁽¹⁾(Z⁽¹⁾)

(W⁽²⁾)ᵀ × dZ⁽²⁾ = [[0.7], [-0.3], [0.9]] × [-0.238, 0.449]


= [[-0.167, 0.314], [0.071, -0.135], [-0.214, 0.404]]

ReLU derivative: g'⁽¹⁾(Z⁽¹⁾) = [[1, 0], [1, 1], [1, 0]]

dZ⁽¹⁾ = [[-0.167, 0], [0.071, -0.135], [-0.214, 0]]

dW⁽¹⁾ = 1/m × dZ⁽¹⁾ × Xᵀ


dW⁽¹⁾ = 1/2 × [[-0.167, 0], [0.071, -0.135], [-0.214, 0]] × [[1.5, -0.8], [2.0, 1.2]]
dW⁽¹⁾ = [[-0.125, 0.067], [0.018, -0.109], [-0.161, 0.086]]
db⁽¹⁾ = 1/m × sum(dZ⁽¹⁾, axis=1) = [[-0.084], [-0.032], [-0.107]]

Dimensions:

dZ⁽²⁾: (1, 2)

dW⁽²⁾: (1, 3)

db⁽²⁾: (1, 1)
dZ⁽¹⁾: (3, 2)

dW⁽¹⁾: (3, 2)
db⁽¹⁾: (3, 1)

Part C: Parameter Update (3 points)


With α = 0.1:

W⁽¹⁾_new = W⁽¹⁾ - α × dW⁽¹⁾


W⁽¹⁾_new = [[0.6, -0.4], [-0.2, 0.8], [0.5, 0.3]] - 0.1 × [[-0.125, 0.067], [0.018, -0.109], [-0.161, 0.086]]
W⁽¹⁾_new = [[0.613, -0.407], [-0.202, 0.811], [0.516, 0.291]]

b⁽¹⁾_new = [0.2, -0.1, 0.0] - 0.1 × [-0.084, -0.032, -0.107] = [0.208, -0.097, 0.011]

W⁽²⁾_new = [0.7, -0.3, 0.9] - 0.1 × [-0.036, 0.086, -0.161] = [0.704, -0.309, 0.916]

b⁽²⁾_new = 0.1 - 0.1 × 0.106 = 0.089

Problem 2: Multi-Example Training with Tanh Activation (25 points)

Given:
Architecture: [3, 4, 1]

W⁽¹⁾ = [[0.3, -0.5, 0.2], [0.4, 0.1, -0.6], [-0.2, 0.7, 0.3], [0.5, -0.3, 0.4]]

b⁽¹⁾ = [0.1, 0.0, -0.2, 0.3]ᵀ

W⁽²⁾ = [0.6, -0.4, 0.8, 0.2], b⁽²⁾ = -0.1

X = [[2.0, 0.5, -1.2], [-0.8, 1.5, 0.7], [1.1, -2.0, 0.3]], Y = [0, 1, 1]

Part A: Activation Function Analysis (8 points)


1. Calculate tanh(1.0) tanh(1.0) = (e¹ - e⁻¹)/(e¹ + e⁻¹) = (2.718 - 0.368)/(2.718 + 0.368) = 2.35/3.086 =
0.762

2. Compute tanh'(1.0) tanh'(z) = 1 - tanh²(z) tanh'(1.0) = 1 - (0.762)² = 1 - 0.581 = 0.419


3. Range comparison:

Sigmoid: (0, 1)

Tanh: (-1, 1)

4. Why tanh is preferred:

Tanh is zero-centered, making optimization easier

Tanh has stronger gradients than sigmoid (no gradient saturation near 0)

Better gradient flow in deep networks

Part B: Complete Forward-Backward Pass (17 points)


Forward Propagation:

Z⁽¹⁾ = W⁽¹⁾X + b⁽¹⁾

For all examples:

Z⁽¹⁾ = [[0.3×2.0 + (-0.5)×(-0.8) + 0.2×1.1 + 0.1], [...], [...], [...]]


= [[1.12], [0.32], [0.89], [1.74]] (example 1)
= [[0.21], [1.09], [0.85], [0.13]] (example 2)
= [[-0.59], [-1.67], [-1.91], [-0.17]] (example 3)

A⁽¹⁾ = tanh(Z⁽¹⁾) = [[0.81, 0.21, -0.53], [0.31, 0.80, -0.94], [0.71, 0.69, -0.95], [0.94, 0.13, -0.17]]

Z⁽²⁾ = W⁽²⁾A⁽¹⁾ + b⁽²⁾


Z⁽²⁾ = [1.51, 0.89, -1.42]

A⁽²⁾ = sigmoid(Z⁽²⁾) = [0.82, 0.71, 0.19]

Cost Function: J = -1/3 [0×log(0.82) + 1×log(1-0.82) + 1×log(0.71) + 0×log(1-0.71) + 1×log(0.19) +


0×log(1-0.19)] J = -1/3 [log(0.18) + log(0.71) + log(0.19)] = 1.31

Backward Propagation: dZ⁽²⁾ = A⁽²⁾ - Y = [0.82, -0.29, -0.81] dW⁽²⁾ = 1/3 × dZ⁽²⁾ × (A⁽¹⁾)ᵀ = [-0.13, 0.15,
0.29, -0.19] db⁽²⁾ = 1/3 × sum(dZ⁽²⁾) = -0.09

dZ⁽¹⁾ = (W⁽²⁾)ᵀ × dZ⁽²⁾ ⊙ (1 - (A⁽¹⁾)²)


[Detailed calculations would follow similar pattern]

Gradient Dimensions:

dW⁽²⁾: (1, 4) ✓

db⁽²⁾: (1, 1) ✓
dW⁽¹⁾: (4, 3) ✓

db⁽¹⁾: (4, 1) ✓

Problem 3: Vectorized Implementation (25 points)

Given:
Architecture: [2, 4, 1]

W⁽¹⁾ = [[0.2, -0.3], [-0.1, 0.5], [0.6, 0.1], [-0.4, 0.7]]

b⁽¹⁾ = [0.1, -0.05, 0.2, 0.0]ᵀ


W⁽²⁾ = [0.3, -0.2, 0.5, 0.1], b⁽²⁾ = 0.05

X = [[1.2, -0.8, 0.5, 2.1], [-1.5, 1.0, -0.3, 0.7]], Y = [1, 0, 1, 0]

Part A: Forward Propagation (10 points)


Step 1: Compute Z⁽¹⁾ Z⁽¹⁾ = W⁽¹⁾X + b⁽¹⁾

Matrix dimensions: W⁽¹⁾(4×2) × X(2×4) + b⁽¹⁾(4×1) → Z⁽¹⁾(4×4)

Z⁽¹⁾ = [[0.2×1.2 + (-0.3)×(-1.5) + 0.1], [...], [...], [...]]


= [[0.79, -0.46, 0.19, 0.63],
[-0.70, 0.58, -0.16, -0.14],
[0.57, -0.38, 0.50, 1.33],
[1.57, 1.02, -0.41, 1.33]]

Step 2: Apply ReLU A⁽¹⁾ = max(0, Z⁽¹⁾) = [[0.79, 0, 0.19, 0.63], [0, 0.58, 0, 0], [0.57, 0, 0.50, 1.33], [1.57, 1.02,
0, 1.33]]

Step 3: Compute Z⁽²⁾ and A⁽²⁾ Z⁽²⁾ = W⁽²⁾A⁽¹⁾ + b⁽²⁾ = [0.76, 0.15, 0.31, 0.86] A⁽²⁾ = sigmoid(Z⁽²⁾) = [0.68,
0.54, 0.58, 0.70]

Broadcasting: b⁽¹⁾(4×1) is broadcast to (4×4) by repeating across columns.

Part B: Cost Computation (5 points)


Vectorized Cost:

python

J = -1/m * np.sum(Y * np.log(A[2]) + (1-Y) * np.log(1-A[2]))


J = -1/4 × [1×log(0.68) + 0×log(0.46) + 1×log(0.58) + 0×log(0.30)]
J = -1/4 × [log(0.68) + log(0.58)] = 0.64

Part C: Backward Propagation (10 points)


Step 1: dZ⁽²⁾ dZ⁽²⁾ = A⁽²⁾ - Y = [0.68-1, 0.54-0, 0.58-1, 0.70-0] = [-0.32, 0.54, -0.42, 0.70]

This works because: ∂J/∂z⁽²⁾ = ∂J/∂a⁽²⁾ × ∂a⁽²⁾/∂z⁽²⁾ = (a⁽²⁾-y)/[a⁽²⁾(1-a⁽²⁾)] × a⁽²⁾(1-a⁽²⁾) = a⁽²⁾-y

Step 2: Layer 2 gradients dW⁽²⁾ = 1/m × dZ⁽²⁾ × (A⁽¹⁾)ᵀ Dimensions: (1×4) × (4×4) → (1×4) dW⁽²⁾ = 1/4 ×
[-0.32, 0.54, -0.42, 0.70] × A⁽¹⁾ᵀ = [0.13, 0.09, 0.07, 0.21]

db⁽²⁾ = 1/m × np.sum(dZ⁽²⁾) = 1/4 × 0.50 = 0.125

Step 3: Layer 1 gradients dZ⁽¹⁾ = (W⁽²⁾)ᵀ × dZ⁽²⁾ ⊙ g'(Z⁽¹⁾) ReLU derivative: g'(Z⁽¹⁾) = (Z⁽¹⁾ > 0)

Dimensions: (4×1) × (1×4) ⊙ (4×4) → (4×4)

Step 4: Final gradients dW⁽¹⁾ = 1/m × dZ⁽¹⁾ × Xᵀ Dimensions: (4×4) × (4×2) → (4×2)

db⁽¹⁾ = 1/m × np.sum(dZ⁽¹⁾, axis=1, keepdims=True)


Dimensions: (4×1)

Problem 4: Activation Function Comparison (25 points)

Given:
Architecture: [1, 2, 1]

x = 2.0, y = 1

W⁽¹⁾ = [0.5, -0.3]ᵀ, b⁽¹⁾ = [0.1, 0.2]ᵀ


W⁽²⁾ = [0.4, 0.6], b⁽²⁾ = 0.0

Part A: ReLU Hidden Layer (10 points)


Forward Propagation: Z⁽¹⁾ = W⁽¹⁾x + b⁽¹⁾ = [0.5×2.0 + 0.1, -0.3×2.0 + 0.2] = [1.1, -0.4] A⁽¹⁾ = ReLU(Z⁽¹⁾) =
[1.1, 0]

Z⁽²⁾ = W⁽²⁾A⁽¹⁾ + b⁽²⁾ = 0.4×1.1 + 0.6×0 + 0.0 = 0.44


A⁽²⁾ = sigmoid(0.44) = 0.608

Backward Propagation: dZ⁽²⁾ = A⁽²⁾ - Y = 0.608 - 1 = -0.392 dW⁽²⁾ = dZ⁽²⁾ × (A⁽¹⁾)ᵀ = -0.392 × [1.1, 0] =
[-0.431, 0] db⁽²⁾ = dZ⁽²⁾ = -0.392

dZ⁽¹⁾ = (W⁽²⁾)ᵀ × dZ⁽²⁾ ⊙ ReLU'(Z⁽¹⁾) = [0.4, 0.6]ᵀ × (-0.392) ⊙ [1, 0] = [-0.157, 0]


dW⁽¹⁾ = dZ⁽¹⁾ × x = [-0.157, 0] × 2.0 = [-0.314, 0]
db⁽¹⁾ = dZ⁽¹⁾ = [-0.157, 0]

Part B: Tanh Hidden Layer (10 points)


Forward Propagation: Z⁽¹⁾ = [1.1, -0.4] (same as ReLU case) A⁽¹⁾ = tanh(Z⁽¹⁾) = [0.800, -0.380]

Z⁽²⁾ = 0.4×0.800 + 0.6×(-0.380) = 0.320 - 0.228 = 0.092


A⁽²⁾ = sigmoid(0.092) = 0.523

Backward Propagation: dZ⁽²⁾ = 0.523 - 1 = -0.477 dW⁽²⁾ = -0.477 × [0.800, -0.380] = [-0.382, 0.181] db⁽²⁾
= -0.477

dZ⁽¹⁾ = [0.4, 0.6]ᵀ × (-0.477) ⊙ (1 - tanh²(Z⁽¹⁾))


tanh'(Z⁽¹⁾) = [1 - 0.800², 1 - (-0.380)²] = [0.360, 0.856]
dZ⁽¹⁾ = [-0.191, -0.286] ⊙ [0.360, 0.856] = [-0.069, -0.245]

dW⁽¹⁾ = [-0.069, -0.245] × 2.0 = [-0.138, -0.490]


db⁽¹⁾ = [-0.069, -0.245]

Part C: Analysis (5 points)


1. Range Comparison:

ReLU: [0, ∞) - can produce very large values


Tanh: (-1, 1) - bounded output, zero-centered

2. Vanishing Gradient Problem: Sigmoid suffers from vanishing gradients because:

σ'(z) = σ(z)(1-σ(z)) has maximum value of 0.25

For large |z|, gradient approaches 0

In deep networks, gradients become exponentially small

3. When to Use:

ReLU: Most common choice for hidden layers


Computationally efficient

Avoids vanishing gradient problem

Can suffer from "dying ReLU" problem


Tanh: Better than sigmoid for hidden layers
Zero-centered (helps optimization)

Still can suffer from vanishing gradients for very deep networks

Good when you need bounded, zero-centered outputs

You might also like