21CS743 | DEEP LEARNING
Module-02:
Feed forward Networks and Deep Learning
Introduction to Feed forward Neural Networks
1.1 Basic Concepts
A feed forward neural network is the simplest form of an artificial neural network
(ANN).
Information moves in only one direction: forward, from input nodes through hidden
nodes to output nodes.
No cycles or loops exist in the network structure.
1.2 Historical Context
1. Origins
o Inspired by biological neural networks.
o First proposed by Warren McCulloch and Walter Pitts (1943).
o Significant advancement with the perceptron by Frank Rosenblatt (1958).
2. Evolution
o Transition from single-layer to multi-layer networks.
o Development of backpropagation in 1986.
o Modern deep learning revolution (2012–present).
Dept.,of AD Page 1
21CS743 | DEEP LEARNING
1.3 Network Architecture
1. Input Layer
Receives raw input data
No computation performed
Number of neurons equals the number of input features
Standardization/normalization often applied here
2. Hidden Layers
Performs intermediate computations
Can have multiple hidden layers
Each neuron is connected to all neurons in the previous layer
3. Output Layer
Produces the final network output
Number of neurons depends on the problem type
Classification: typically one neuron per class
Regression: usually one neuron
Dept.,of AD Page 2
21CS743 | DEEP LEARNING
1.4 Activation Functions
1. Sigmoid (Logistic)
Formula: σ(x) = 1 / (1 + e^(-x))
Range: [0,1]
Used in: Binary classification
Properties:
o Smooth gradient
o Clear prediction probability
o Suffers from the vanishing gradient problem
2. Hyperbolic Tangent (tanh)
Formula: tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Range: [-1,1]
Often performs better than sigmoid
Properties:
o Zero-centered
Stronger gradients
Still has vanishing gradient issue
3. ReLU (Rectified Linear Unit)
Formula: f(x) = max(0, x)
Most commonly used
Helps solve the vanishing gradient problem
Properties:
o Computationally efficient
o No saturation in the positive region
o Suffers from the dying ReLU problem
4. Leaky ReLU
Formula: f(x) = max(0.01x, x)
Addresses the dying ReLU problem
Small negative slope
Properties:
Dept.,of AD Page 3
21CS743 | DEEP LEARNING
o Never completely dies
o Allows for negative values
o More robust than standard ReLU
3. Gradient-BasedLearning
2.1 Understanding Gradients
1. Definition
Gradient is a vector of partial derivatives
Points in the direction of the steepest increase
Used to minimize the loss function
2. Properties
Direction indicates the fastest increase
Magnitude indicates steepness
Negative gradient is used for minimization
2.2 Cost Functions
1. Mean Squared Error (MSE)
o Used for regression problems
o Formula:
1. MSE = (1/n) * Σ (y_true - y_pred)^2
o
Properties:
Always positive
Penalizes larger errors more
Differentiable
2. Cross-Entropy Loss
o Used for classification problems
o Formula: -Σ (y_true * log(y_pred))
o Properties:
Measures probability distribution difference
Better for classification than MSE
Provides stronger gradients
3. Huber Loss
Combines MSE and MAE
Less sensitive to outliers
Formula:
Dept.,of AD Page 4
21CS743 | DEEP LEARNING
o L = 0.5 * (y - f(x))^2 If ∣y−f(x)∣≤δ
o L = δ * |y - f(x)| - 0.5 * δ^2 otherwise
2.3 Gradient Descent Types
1. Batch Gradient Descent
o Uses the entire dataset for each update
o More stable but slower
o Formula: θ = θ - α * ∇J(θ)
o Memory intensive for large datasets
2. Stochastic Gradient Descent (SGD)
o Updates parameters after each sample
o Faster but less stable
o Better for large datasets
o High variance in parameter updates
3. Mini-batch Gradient Descent
o Compromise between batch and SGD
o Updates parameters after small batches
o Most commonly used in practice
o Typical batch sizes: 32, 64, 128
4. Advanced Optimizers
Adam (Adaptive Moment Estimation)
Combines momentum and RMSprop
Adaptive learning rates
Formula includes first and second moments
b) RMSprop
Adaptive learning rates
Divides by running average of gradient magnitudes
c) Momentum
Adds fraction of previous update
Helps escape local minima
Reduces oscillation
3. Backpropagation and Chain Rule
Dept.,of AD Page 5
21CS743 | DEEP LEARNING
3.1 Chain Rule Fundamentals
1. Mathematical Basis
o df/dx = df/dy * dy/dx
o Allows computation of composite function derivatives
o Essential for neural network training
2. Application in Neural Networks
o Computes gradients layer by layer
o Propagates error backwards
o Updates weights based on contribution to error
3.2 Forward Pass
1. Input Processing
Data normalization
Weight initialization
Bias addition
2. Layer Computation
Python
Copy
# Pseudo-code for forward pass
python
CopyEdit
for layer in network:
Z = W * A + b # Linear transformation
A = activation(Z) # Apply activation function
3. Output Generation
Final layer activation
Prediction computation
Error calculation
Dept.,of AD Page 6
21CS743 | DEEP LEARNING
3.3 Backward Pass
1. Error Calculation
Compare output with target
Calculate loss using cost function
Initialize gradient computation
2. Weight Updates
Calculate gradients using chain rule
Update weights:
W_new = W_old - learning_rate * gradientUpdate biases similarly
3. Detailed Steps
Python
Copy
Pseudo-code for backward pass
Output layer:
python
CopyEdit
dZ = A - Y # For Mean Squared Error (MSE)
dW = (1/m) * dZ * A_prev.T
db = (1/m) * sum(dZ)
Hidden layers:
python
CopyEdit
dZ = dA * activation_derivative(Z)
dW = (1/m) * dZ * A_prev.T
db = (1/m) * sum(dZ)
4. Regularization for Deep Learning
4.1 L1 Regularization
Mathematical Form:
Adds the absolute value of weights to loss
Dept.,of AD Page 7
21CS743 | DEEP LEARNING
4.1 L1 Regularization
1. Mathematical Form
Adds absolute value of weights to loss
Formula: L1=λ∑ |W|
2. Properties
Feature selection capability
Produces sparse models
Less sensitive to outliers
4.2 L2 Regularization
1. Mathematical Form
Adds squared weights to loss
Formula: L2=λ∑ W2
Prevents large weights
2. Properties
Smooth weight decay
No sparse solutions
More stable training
4.3 Dropout
1. Basic Concept
Randomly deactivate neurons
Probability ppp of keeping neurons
fferent network for each training batch
2. Implementation Details
Python
Copy
# Pseudo-code for dropout
python
CopyEdit
mask = np.random.binomial(1, p, size=layer_size)
Dept.,of AD Page 8
21CS743 | DEEP LEARNING
A = A * mask
A = A / p # Scale to maintain expected value
3. Training vs. Testing
Used only during training
Scaled appropriately during inference
Acts as model ensemble
4.4 Early Stopping
1. Implementation
Monitor validation error
Save best model
Stop when validation error increases
2. Benefits
Prevents overfitting
Reduces training time
Automatic model selection
5. Advanced Concepts
5.1 Batch Normalization
1. Purpose
Normalizes layer inputs
Reduces internal covariate shift
Speeds up training
2. Algorithm
Python
Copy
# Pseudo-code for batch normalization
python
CopyEdit
mean = np.mean(x, axis=0)
var = np.var(x, axis=0)
x_norm = (x - mean) / np.sqrt(var + ε)
Dept.,of AD Page 9
21CS743 | DEEP LEARNING
out = gamma * x_norm + beta
5.2 Weight Initialization
1. Xavier/Glorot Initialization
Variance = 2 / (nin + nout)
Suitable for tanh activation
2. He Initialization
Variance = 2 / nin
Better for ReLU activation
6. Practical Implementation
6.1 Network Design Considerations
1. Architecture Choices
Number of layers
Neurons per layer
Activation functions
2. Hyperparameter Selection
Learning rate
Batch size
Regularization stre 1. Basic Concepts
Explain the role of activation functions in neural networks
Compare and contrast different types of gradient descent
Describe the vanishing gradient problem
2. Mathematical Problems
Calculate gradients for a simple 2-layer network
Implement batch normalization equations
Compute different loss functions
3. Implementation Challenges
Dept.,of AD Page 10
21CS743 | DEEP LEARNING
Design a network for MNIST classification
Implement dropout in Python
Create a custom loss function
6.2 Training Process
1. Data Preparation
Splitting data
Normalization
Augmentation
2. Training Loop
Forward pass
Loss computation
Backward pass
Parameter updates
Key Formulas Reference Sheet
1. Activation Functions
Sigmoid: σ( ) = 1 / (1 + e⁻ˣ)
Tanh: tanh( ) = (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)
ReLU: f( ) = max(0, )
2. Loss Functions
Mean Squared Error (MSE): (1/n) ∑( _true - _pred)²
Cross-Entropy: -∑ ( _true × log( _pred))
3. Regularization
L1 Regularization: L₁ = λ∑| |
3. Regularization
L1 Regularization: L₁ = λ∑| |
L2 Regularization: L₂ = λ∑ ²
4. Gradient Descent
Update Rule: = - α∇J( )
Momentum: = β - α∇J( )
Dept.,of AD Page 11
21CS743 | DEEP LEARNING
Common Issues and Solutions
1. Vanishing Gradients
Use ReLU activation
Implement batch normalization
Try residual connections
2. Overfitting
Add dropout
Use regularization
Implement early stopping
3. Poor Convergence
Adjust learning rate
Try different optimizers
Check data normalization
Dept.,of AD Page 12