Artificial intelligence:
techniques that enable machines to mimic human behavior
Machine learning:
ability to learn without explicit instructions
Deep learning:
extract patterns from raw data using neural networks
Why deep learning:
hand coding features is time consuming and not scalable:
Why use em now:
-easier to collect and store large data
-higher performing hardware
-new techniques and frameworks, and new architectures and models
Neural network basics:
Activation functions :
Activation functions are used to introduce nonlinearity to the network, for example if we separate
green from red here:
If we don’t use an activation function the network would predict like this:
No matter the dispersion the network would always act like linear regression, activation function
allow it to recognize higher patterns:
Multi layered neural networks:
-multi output perceptron: instead of a single output we have multiple outputs with the same inputs
but different weights:
-single layer network: this network has one hidden layer in which the model learns patterns, each
node in the layer takes inputs X and weights W:
This value is then passed through an activation function to add nonlinearity and then passed to the
output layer, the output layer has its own weight and activation function:
Quantifying loss :
the networks quantifying loss is the difference between the predicted value and actual value and is
represented as:
Empirical loss:
Measures the total loss over the entire network also known as cost function or objective function or
risk
Binary cross entropy:
this loss function is specifically used for classification problems ( 0,1 output):
Mean squared error:
this is used for regression models with continuous output:
Training neural network:
training a neural network means updating the weights of a model till the difference between the
predicted value and actual value is minimal
Gradient descent:
-initialize weights randomly
-loop till convergence, which is either running out of iterations, improvements are too small.
-calculate the gradient
-update the weights
where n represents the learning rate which indicates how big of steps we take
-return weights
This approach is computationally expensive as we do this for every training example
Stochastic gradient descent:
Instead of using the entire data set like regular gradient descent,, stochastic picks a random example
at each step, updates the weights based on that and continues till convergence but the updates are
noisier (due to the random nature of the example)
Mini batches gradient descent:
This one is a compromise between gradient and stochastic where it picks a small part of the set to
update the weights on then computes the gradient:
Batch normalization:
Batch normalization is a technique used to help models train faster and more stable, the input of
networks keeps changing during training which is knows as internal covariate shift
Normalization:
Also known as min max scaling, it compresses all values to be between 1 and 0 but preserves the
shape of the data
Standardization:
transforms the data so the mean is 0 and variance is 1 which results in a normal distribution
Batch normalization process:
-select a mini batch
-compute mean : μ = (1/m)∑(x_i)
-compute variance: σ² = (1/m)∑(x_i - μ)²
-normalize : : x̂_i = (x_i - μ)/√(σ² + ε) where ε is to prevent division by zero
-scale and shift: y_i = γx̂_i + β where γ is a learnable scale and β is a learnable shift
Benefits:
-centers input around 0
-achieve accuracy faster
-better performance
-no standardization layer
-no regularization
-epochs will take longer but convergence is faster
Computer vision:
discover what is going on in the world and predict upcoming events based on it. This field has
applications in medicine, automation, accessibility and robotics
What do computers see:
to a computer an image is a matrix of numbers with width And height and depth, this depth
represents how many channels it has, an RGB image has three channels so a 10800*1080 image is
1080*1080*3
Computer vision tasks:
regression: predicting a continuous value
Classification: predicting what is the class of an object
Manual feature extraction:
to identify features manually, we first need domain knowledge, then a definition of the feature we
want to extract, then we detect features and classify them
Fully connected neural network:
a fully connected network has every neuron in the current layer connected to every neuro in the
previous layer
in this architecture, the input is a vector of pixel values. This leads to the loss of the spatial structure,
and being fully connected means there are a lot of weights
to solve this issue, we connect only patches of input to a neuron in the hidden layer, basically only
giving it a window of the original input
we slide this window across the whole window to define connections
this technique is called convolution
CNN:
this is a neural network that learns feature through applying filters or kernels and its usually compose
of:
- convolutional layers for feature extraction
-An activation to add non linearity , usually Relu
- pooling, either max or avg to downsize the resulting features
for a neuron in the hidden layer:
-it takes a patch
-calculates the weighted sum and adds the bias
The output of a convolutional network is a volume where height and width are spatial dimensions
and depth is the number of filters or feature maps extracted
The Relu activation functions replaces all negative values after a convolution with a 0
Pooling:
pooling is down Samling while retaining spatial invariance, it is done by sliding a filter across the
feature maps and picking the max value in max pooling or the avg value in avg pooling
CNN for classification:
after convolutions and activation and pooling , the resulting features are passed to a fully connected
layer for classification which expresses the possibility of an image belonging to a class
Lennet 5:
-introduced In 1989 for character recognition but was limited due to hardware
Key features:
-convulutional layers extract features
-tanh activation function
-fully connected layer for classification
-sparse connections reduce complexity
Architecture:
Advantages:
-simple and efficient for small sets
-low complexity
-demonstrates effectiveness of CNNs
Disadvantages:
-limited to small inputs 32*32
-not effective for complex set
-requires pre processing
VGG-NET:
developed by visual geometry group and is either 16 or 19 layers, used for image detection,
introduced in 2014
VGG 19 adds 3 extra convolutional layers which means slightly better performance
Advantages:
-easy to implement
-small filters extract better feature with reduced computational load
-widely available pre trained models
-versatile
Disadvantages:
-very slow to train
-heavy requirements
-inefficient
Applications:
-pre trained models in transfer learning
-object detection and feature extraction
GoogleNet:
introduced in 2014 and known as inception V1 and focuses on accuracy and efficiency
Why it came:
-shallow architectures like LeNet-5 and AlexNet
-vanishing gradients
-high computational costs in deeper networks
Architectural highlights:
Uses inception models for parallel processing at different scales
Use global average pooling
Uses auxiliary classifiers during training to solve vanishing gradients
What are inception modules:
introduced with google net and captures feature more efficiently than other models by processing
images in parallel
Evolution of inception architecture:
-Inception-v2: batch normalization and more efficient convolutions
-Inception-v3: atrous convolutions
-Inception-v4 and ResNet hybrids: residual connections for better training
What is googlenet used for:
-image classification
-object detection
-image segmentation
-video analysis
Advantages:
-efficient feature extraction
-reduced parameters
-improved accuracy
Disadvantages:
-complex architecture
-requires high experience to customize it
-computationally expensive during inference
Legacy of google net:
-set a foundation for modular architectures
-first for hybrid models
-inspired light weight models
Residual networks:
early architectures used more layers to reduce errors rates, but this led to more problems such as
vanishing gradients and higher computational costs
Why ResNets:
the issue Was that deeper networks tend to perform worse than shallow networks due to vanishing
or exploding gradients
This was solved by introducing residual blocks with skip connections
Residual blocks and skip connections:
provide two paths for information to flow, effectively allowing the network to bypass layers it deems
unnecessary, this allows the network to learn on residuals and solve vanishing and exploding
gradients
How resnet works:
inspired by VGG 19 architecture it is a 34 layer network with shortcut or skip connections. Used CIFAR
dataset to test depth up to 1000 layers
Why do they work:
-overcome problems with very deep networks
-poor performing layers are bypassed by the network
-superior accuracy
Sequence modeling applications:
-machine translation
-image captioning
-sentiment classification
Neurons with recurrence:
in a neuron with recurrence, there is an additional slot for the state of the neuron called hidden
state, each output of the neuron depends on its current input and its previous context or state.
Recurrent neural networks :
in an RNN, information is maintained about past inputs through a state. Every step through
processing a sequence, a recurrence relation is applied:
RNN state update and output:
to update the hidden state we take the input vector and:
Apply the tanh activation function to hidden state to hidden state weight plus the input to hidden
weight and you get the new state
To get the output:
We multiply the hidden state to output weights by the new state
Backpropagation through time:
When we unroll an RNN, we notice that at each instance T its calculating a hidden state and output,
then the model calculates the loss by comparing predicted with actual, the total loss is :
The gradient of this loss is then back propagated through the unrolled RNN
The problem of long term dependency:
when calculating BPTT, gradients are computed by repeatedly multiplying derivatives, if these weights
are small, repeated multiplication can cause then to exponentially shrink as we move backwards. This
means long term errors contribute less and less. This makes the network biased towards short term
dependencies