Deep Learning Assignment 2: Shallow Neural Networks -
Solutions
Problem 1: Binary Classification with ReLU and Sigmoid (25 points)
Given:
Architecture: [2, 3, 1]
W⁽¹⁾ = [[0.6, -0.4], [-0.2, 0.8], [0.5, 0.3]], b⁽¹⁾ = [0.2, -0.1, 0.0]ᵀ
W⁽²⁾ = [0.7, -0.3, 0.9], b⁽²⁾ = 0.1
X = [[1.5, -0.8], [2.0, 1.2]], Y = [1, 0]
Part A: Forward Propagation (10 points)
Step 1: Compute Z⁽¹⁾ and A⁽¹⁾
Z⁽¹⁾ = W⁽¹⁾X + b⁽¹⁾
For example 1 (x₁ = [1.5, 2.0]):
Z⁽¹⁾₁ = [0.6×1.5 + (-0.4)×2.0 + 0.2] = [0.9 - 0.8 + 0.2] = [0.3]
Z⁽¹⁾₂ = [-0.2×1.5 + 0.8×2.0 + (-0.1)] = [-0.3 + 1.6 - 0.1] = [1.2]
Z⁽¹⁾₃ = [0.5×1.5 + 0.3×2.0 + 0.0] = [0.75 + 0.6] = [1.35]
For example 2 (x₂ = [-0.8, 1.2]):
Z⁽¹⁾₁ = [0.6×(-0.8) + (-0.4)×1.2 + 0.2] = [-0.48 - 0.48 + 0.2] = [-0.76]
Z⁽¹⁾₂ = [-0.2×(-0.8) + 0.8×1.2 + (-0.1)] = [0.16 + 0.96 - 0.1] = [1.02]
Z⁽¹⁾₃ = [0.5×(-0.8) + 0.3×1.2 + 0.0] = [-0.4 + 0.36] = [-0.04]
Z⁽¹⁾ = [[0.3, -0.76], [1.2, 1.02], [1.35, -0.04]]
Apply ReLU: A⁽¹⁾ = max(0, Z⁽¹⁾)
A⁽¹⁾ = [[0.3, 0], [1.2, 1.02], [1.35, 0]]
Step 2: Compute Z⁽²⁾ and A⁽²⁾
Z⁽²⁾ = W⁽²⁾A⁽¹⁾ + b⁽²⁾
For example 1:
Z⁽²⁾₁ = 0.7×0.3 + (-0.3)×1.2 + 0.9×1.35 + 0.1 = 0.21 - 0.36 + 1.215 + 0.1 = 1.165
For example 2:
Z⁽²⁾₂ = 0.7×0 + (-0.3)×1.02 + 0.9×0 + 0.1 = 0 - 0.306 + 0 + 0.1 = -0.206
Z⁽²⁾ = [1.165, -0.206]
Apply Sigmoid: A⁽²⁾ = 1/(1 + e⁻ᶻ⁽²⁾)
A⁽²⁾₁ = 1/(1 + e⁻¹·¹⁶⁵) = 1/(1 + 0.312) = 0.762
A⁽²⁾₂ = 1/(1 + e⁰·²⁰⁶) = 1/(1 + 1.229) = 0.449
A⁽²⁾ = [0.762, 0.449]
Step 3: Calculate Cost J
J = -1/m ∑[y⁽ⁱ⁾log(a⁽²⁾⁽ⁱ⁾) + (1-y⁽ⁱ⁾)log(1-a⁽²⁾⁽ⁱ⁾)]
J = -1/2 [1×log(0.762) + 0×log(1-0.762) + 0×log(0.449) + 1×log(1-0.449)]
J = -1/2 [log(0.762) + log(0.551)]
J = -1/2 [-0.272 + (-0.596)] = -1/2 × (-0.868) = 0.434
Part B: Backward Propagation (12 points)
Step 1: Output layer gradients
dZ⁽²⁾ = A⁽²⁾ - Y = [0.762, 0.449] - [1, 0] = [-0.238, 0.449]
dW⁽²⁾ = 1/m × dZ⁽²⁾ × (A⁽¹⁾)ᵀ
dW⁽²⁾ = 1/2 × [-0.238, 0.449] × [[0.3, 0], [1.2, 1.02], [1.35, 0]]ᵀ
dW⁽²⁾ = 1/2 × [-0.238×0.3 + 0.449×0, -0.238×1.2 + 0.449×1.02, -0.238×1.35 + 0.449×0]
dW⁽²⁾ = 1/2 × [-0.071, 0.172, -0.321] = [-0.036, 0.086, -0.161]
db⁽²⁾ = 1/m × sum(dZ⁽²⁾) = 1/2 × (-0.238 + 0.449) = 0.106
Step 2: Hidden layer gradients
dZ⁽¹⁾ = (W⁽²⁾)ᵀ × dZ⁽²⁾ ⊙ g'⁽¹⁾(Z⁽¹⁾)
(W⁽²⁾)ᵀ × dZ⁽²⁾ = [[0.7], [-0.3], [0.9]] × [-0.238, 0.449]
= [[-0.167, 0.314], [0.071, -0.135], [-0.214, 0.404]]
ReLU derivative: g'⁽¹⁾(Z⁽¹⁾) = [[1, 0], [1, 1], [1, 0]]
dZ⁽¹⁾ = [[-0.167, 0], [0.071, -0.135], [-0.214, 0]]
dW⁽¹⁾ = 1/m × dZ⁽¹⁾ × Xᵀ
dW⁽¹⁾ = 1/2 × [[-0.167, 0], [0.071, -0.135], [-0.214, 0]] × [[1.5, -0.8], [2.0, 1.2]]
dW⁽¹⁾ = [[-0.125, 0.067], [0.018, -0.109], [-0.161, 0.086]]
db⁽¹⁾ = 1/m × sum(dZ⁽¹⁾, axis=1) = [[-0.084], [-0.032], [-0.107]]
Dimensions:
dZ⁽²⁾: (1, 2)
dW⁽²⁾: (1, 3)
db⁽²⁾: (1, 1)
dZ⁽¹⁾: (3, 2)
dW⁽¹⁾: (3, 2)
db⁽¹⁾: (3, 1)
Part C: Parameter Update (3 points)
With α = 0.1:
W⁽¹⁾_new = W⁽¹⁾ - α × dW⁽¹⁾
W⁽¹⁾_new = [[0.6, -0.4], [-0.2, 0.8], [0.5, 0.3]] - 0.1 × [[-0.125, 0.067], [0.018, -0.109], [-0.161, 0.086]]
W⁽¹⁾_new = [[0.613, -0.407], [-0.202, 0.811], [0.516, 0.291]]
b⁽¹⁾_new = [0.2, -0.1, 0.0] - 0.1 × [-0.084, -0.032, -0.107] = [0.208, -0.097, 0.011]
W⁽²⁾_new = [0.7, -0.3, 0.9] - 0.1 × [-0.036, 0.086, -0.161] = [0.704, -0.309, 0.916]
b⁽²⁾_new = 0.1 - 0.1 × 0.106 = 0.089
Problem 2: Multi-Example Training with Tanh Activation (25 points)
Given:
Architecture: [3, 4, 1]
W⁽¹⁾ = [[0.3, -0.5, 0.2], [0.4, 0.1, -0.6], [-0.2, 0.7, 0.3], [0.5, -0.3, 0.4]]
b⁽¹⁾ = [0.1, 0.0, -0.2, 0.3]ᵀ
W⁽²⁾ = [0.6, -0.4, 0.8, 0.2], b⁽²⁾ = -0.1
X = [[2.0, 0.5, -1.2], [-0.8, 1.5, 0.7], [1.1, -2.0, 0.3]], Y = [0, 1, 1]
Part A: Activation Function Analysis (8 points)
1. Calculate tanh(1.0) tanh(1.0) = (e¹ - e⁻¹)/(e¹ + e⁻¹) = (2.718 - 0.368)/(2.718 + 0.368) = 2.35/3.086 =
0.762
2. Compute tanh'(1.0) tanh'(z) = 1 - tanh²(z) tanh'(1.0) = 1 - (0.762)² = 1 - 0.581 = 0.419
3. Range comparison:
Sigmoid: (0, 1)
Tanh: (-1, 1)
4. Why tanh is preferred:
Tanh is zero-centered, making optimization easier
Tanh has stronger gradients than sigmoid (no gradient saturation near 0)
Better gradient flow in deep networks
Part B: Complete Forward-Backward Pass (17 points)
Forward Propagation:
Z⁽¹⁾ = W⁽¹⁾X + b⁽¹⁾
For all examples:
Z⁽¹⁾ = [[0.3×2.0 + (-0.5)×(-0.8) + 0.2×1.1 + 0.1], [...], [...], [...]]
= [[1.12], [0.32], [0.89], [1.74]] (example 1)
= [[0.21], [1.09], [0.85], [0.13]] (example 2)
= [[-0.59], [-1.67], [-1.91], [-0.17]] (example 3)
A⁽¹⁾ = tanh(Z⁽¹⁾) = [[0.81, 0.21, -0.53], [0.31, 0.80, -0.94], [0.71, 0.69, -0.95], [0.94, 0.13, -0.17]]
Z⁽²⁾ = W⁽²⁾A⁽¹⁾ + b⁽²⁾
Z⁽²⁾ = [1.51, 0.89, -1.42]
A⁽²⁾ = sigmoid(Z⁽²⁾) = [0.82, 0.71, 0.19]
Cost Function: J = -1/3 [0×log(0.82) + 1×log(1-0.82) + 1×log(0.71) + 0×log(1-0.71) + 1×log(0.19) +
0×log(1-0.19)] J = -1/3 [log(0.18) + log(0.71) + log(0.19)] = 1.31
Backward Propagation: dZ⁽²⁾ = A⁽²⁾ - Y = [0.82, -0.29, -0.81] dW⁽²⁾ = 1/3 × dZ⁽²⁾ × (A⁽¹⁾)ᵀ = [-0.13, 0.15,
0.29, -0.19] db⁽²⁾ = 1/3 × sum(dZ⁽²⁾) = -0.09
dZ⁽¹⁾ = (W⁽²⁾)ᵀ × dZ⁽²⁾ ⊙ (1 - (A⁽¹⁾)²)
[Detailed calculations would follow similar pattern]
Gradient Dimensions:
dW⁽²⁾: (1, 4) ✓
db⁽²⁾: (1, 1) ✓
dW⁽¹⁾: (4, 3) ✓
db⁽¹⁾: (4, 1) ✓
Problem 3: Vectorized Implementation (25 points)
Given:
Architecture: [2, 4, 1]
W⁽¹⁾ = [[0.2, -0.3], [-0.1, 0.5], [0.6, 0.1], [-0.4, 0.7]]
b⁽¹⁾ = [0.1, -0.05, 0.2, 0.0]ᵀ
W⁽²⁾ = [0.3, -0.2, 0.5, 0.1], b⁽²⁾ = 0.05
X = [[1.2, -0.8, 0.5, 2.1], [-1.5, 1.0, -0.3, 0.7]], Y = [1, 0, 1, 0]
Part A: Forward Propagation (10 points)
Step 1: Compute Z⁽¹⁾ Z⁽¹⁾ = W⁽¹⁾X + b⁽¹⁾
Matrix dimensions: W⁽¹⁾(4×2) × X(2×4) + b⁽¹⁾(4×1) → Z⁽¹⁾(4×4)
Z⁽¹⁾ = [[0.2×1.2 + (-0.3)×(-1.5) + 0.1], [...], [...], [...]]
= [[0.79, -0.46, 0.19, 0.63],
[-0.70, 0.58, -0.16, -0.14],
[0.57, -0.38, 0.50, 1.33],
[1.57, 1.02, -0.41, 1.33]]
Step 2: Apply ReLU A⁽¹⁾ = max(0, Z⁽¹⁾) = [[0.79, 0, 0.19, 0.63], [0, 0.58, 0, 0], [0.57, 0, 0.50, 1.33], [1.57, 1.02,
0, 1.33]]
Step 3: Compute Z⁽²⁾ and A⁽²⁾ Z⁽²⁾ = W⁽²⁾A⁽¹⁾ + b⁽²⁾ = [0.76, 0.15, 0.31, 0.86] A⁽²⁾ = sigmoid(Z⁽²⁾) = [0.68,
0.54, 0.58, 0.70]
Broadcasting: b⁽¹⁾(4×1) is broadcast to (4×4) by repeating across columns.
Part B: Cost Computation (5 points)
Vectorized Cost:
python
J = -1/m * np.sum(Y * np.log(A[2]) + (1-Y) * np.log(1-A[2]))
J = -1/4 × [1×log(0.68) + 0×log(0.46) + 1×log(0.58) + 0×log(0.30)]
J = -1/4 × [log(0.68) + log(0.58)] = 0.64
Part C: Backward Propagation (10 points)
Step 1: dZ⁽²⁾ dZ⁽²⁾ = A⁽²⁾ - Y = [0.68-1, 0.54-0, 0.58-1, 0.70-0] = [-0.32, 0.54, -0.42, 0.70]
This works because: ∂J/∂z⁽²⁾ = ∂J/∂a⁽²⁾ × ∂a⁽²⁾/∂z⁽²⁾ = (a⁽²⁾-y)/[a⁽²⁾(1-a⁽²⁾)] × a⁽²⁾(1-a⁽²⁾) = a⁽²⁾-y
Step 2: Layer 2 gradients dW⁽²⁾ = 1/m × dZ⁽²⁾ × (A⁽¹⁾)ᵀ Dimensions: (1×4) × (4×4) → (1×4) dW⁽²⁾ = 1/4 ×
[-0.32, 0.54, -0.42, 0.70] × A⁽¹⁾ᵀ = [0.13, 0.09, 0.07, 0.21]
db⁽²⁾ = 1/m × np.sum(dZ⁽²⁾) = 1/4 × 0.50 = 0.125
Step 3: Layer 1 gradients dZ⁽¹⁾ = (W⁽²⁾)ᵀ × dZ⁽²⁾ ⊙ g'(Z⁽¹⁾) ReLU derivative: g'(Z⁽¹⁾) = (Z⁽¹⁾ > 0)
Dimensions: (4×1) × (1×4) ⊙ (4×4) → (4×4)
Step 4: Final gradients dW⁽¹⁾ = 1/m × dZ⁽¹⁾ × Xᵀ Dimensions: (4×4) × (4×2) → (4×2)
db⁽¹⁾ = 1/m × np.sum(dZ⁽¹⁾, axis=1, keepdims=True)
Dimensions: (4×1)
Problem 4: Activation Function Comparison (25 points)
Given:
Architecture: [1, 2, 1]
x = 2.0, y = 1
W⁽¹⁾ = [0.5, -0.3]ᵀ, b⁽¹⁾ = [0.1, 0.2]ᵀ
W⁽²⁾ = [0.4, 0.6], b⁽²⁾ = 0.0
Part A: ReLU Hidden Layer (10 points)
Forward Propagation: Z⁽¹⁾ = W⁽¹⁾x + b⁽¹⁾ = [0.5×2.0 + 0.1, -0.3×2.0 + 0.2] = [1.1, -0.4] A⁽¹⁾ = ReLU(Z⁽¹⁾) =
[1.1, 0]
Z⁽²⁾ = W⁽²⁾A⁽¹⁾ + b⁽²⁾ = 0.4×1.1 + 0.6×0 + 0.0 = 0.44
A⁽²⁾ = sigmoid(0.44) = 0.608
Backward Propagation: dZ⁽²⁾ = A⁽²⁾ - Y = 0.608 - 1 = -0.392 dW⁽²⁾ = dZ⁽²⁾ × (A⁽¹⁾)ᵀ = -0.392 × [1.1, 0] =
[-0.431, 0] db⁽²⁾ = dZ⁽²⁾ = -0.392
dZ⁽¹⁾ = (W⁽²⁾)ᵀ × dZ⁽²⁾ ⊙ ReLU'(Z⁽¹⁾) = [0.4, 0.6]ᵀ × (-0.392) ⊙ [1, 0] = [-0.157, 0]
dW⁽¹⁾ = dZ⁽¹⁾ × x = [-0.157, 0] × 2.0 = [-0.314, 0]
db⁽¹⁾ = dZ⁽¹⁾ = [-0.157, 0]
Part B: Tanh Hidden Layer (10 points)
Forward Propagation: Z⁽¹⁾ = [1.1, -0.4] (same as ReLU case) A⁽¹⁾ = tanh(Z⁽¹⁾) = [0.800, -0.380]
Z⁽²⁾ = 0.4×0.800 + 0.6×(-0.380) = 0.320 - 0.228 = 0.092
A⁽²⁾ = sigmoid(0.092) = 0.523
Backward Propagation: dZ⁽²⁾ = 0.523 - 1 = -0.477 dW⁽²⁾ = -0.477 × [0.800, -0.380] = [-0.382, 0.181] db⁽²⁾
= -0.477
dZ⁽¹⁾ = [0.4, 0.6]ᵀ × (-0.477) ⊙ (1 - tanh²(Z⁽¹⁾))
tanh'(Z⁽¹⁾) = [1 - 0.800², 1 - (-0.380)²] = [0.360, 0.856]
dZ⁽¹⁾ = [-0.191, -0.286] ⊙ [0.360, 0.856] = [-0.069, -0.245]
dW⁽¹⁾ = [-0.069, -0.245] × 2.0 = [-0.138, -0.490]
db⁽¹⁾ = [-0.069, -0.245]
Part C: Analysis (5 points)
1. Range Comparison:
ReLU: [0, ∞) - can produce very large values
Tanh: (-1, 1) - bounded output, zero-centered
2. Vanishing Gradient Problem: Sigmoid suffers from vanishing gradients because:
σ'(z) = σ(z)(1-σ(z)) has maximum value of 0.25
For large |z|, gradient approaches 0
In deep networks, gradients become exponentially small
3. When to Use:
ReLU: Most common choice for hidden layers
Computationally efficient
Avoids vanishing gradient problem
Can suffer from "dying ReLU" problem
Tanh: Better than sigmoid for hidden layers
Zero-centered (helps optimization)
Still can suffer from vanishing gradients for very deep networks
Good when you need bounded, zero-centered outputs