Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views15 pages

Artificial Intelligence

The document provides an overview of artificial intelligence, machine learning, and deep learning, detailing techniques such as neural networks, activation functions, and various training methods like gradient descent. It discusses the architecture and applications of convolutional neural networks (CNNs), including notable models like LeNet-5, VGG-Net, and GoogleNet, as well as the challenges of deep networks addressed by residual networks. Additionally, it covers sequence modeling applications and the mechanics of recurrent neural networks (RNNs), highlighting issues such as long-term dependency in training.

Uploaded by

Azzedine Bk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views15 pages

Artificial Intelligence

The document provides an overview of artificial intelligence, machine learning, and deep learning, detailing techniques such as neural networks, activation functions, and various training methods like gradient descent. It discusses the architecture and applications of convolutional neural networks (CNNs), including notable models like LeNet-5, VGG-Net, and GoogleNet, as well as the challenges of deep networks addressed by residual networks. Additionally, it covers sequence modeling applications and the mechanics of recurrent neural networks (RNNs), highlighting issues such as long-term dependency in training.

Uploaded by

Azzedine Bk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Artificial intelligence:

techniques that enable machines to mimic human behavior

Machine learning:
ability to learn without explicit instructions

Deep learning:
extract patterns from raw data using neural networks

Why deep learning:


hand coding features is time consuming and not scalable:

Why use em now:

-easier to collect and store large data

-higher performing hardware

-new techniques and frameworks, and new architectures and models


Neural network basics:

Activation functions :

Activation functions are used to introduce nonlinearity to the network, for example if we separate
green from red here:
If we don’t use an activation function the network would predict like this:

No matter the dispersion the network would always act like linear regression, activation function
allow it to recognize higher patterns:
Multi layered neural networks:
-multi output perceptron: instead of a single output we have multiple outputs with the same inputs
but different weights:

-single layer network: this network has one hidden layer in which the model learns patterns, each
node in the layer takes inputs X and weights W:

This value is then passed through an activation function to add nonlinearity and then passed to the
output layer, the output layer has its own weight and activation function:

Quantifying loss :
the networks quantifying loss is the difference between the predicted value and actual value and is
represented as:

Empirical loss:

Measures the total loss over the entire network also known as cost function or objective function or
risk
Binary cross entropy:
this loss function is specifically used for classification problems ( 0,1 output):

Mean squared error:


this is used for regression models with continuous output:

Training neural network:


training a neural network means updating the weights of a model till the difference between the
predicted value and actual value is minimal

Gradient descent:
-initialize weights randomly
-loop till convergence, which is either running out of iterations, improvements are too small.

-calculate the gradient

-update the weights

where n represents the learning rate which indicates how big of steps we take

-return weights

This approach is computationally expensive as we do this for every training example

Stochastic gradient descent:

Instead of using the entire data set like regular gradient descent,, stochastic picks a random example
at each step, updates the weights based on that and continues till convergence but the updates are
noisier (due to the random nature of the example)

Mini batches gradient descent:

This one is a compromise between gradient and stochastic where it picks a small part of the set to
update the weights on then computes the gradient:

Batch normalization:

Batch normalization is a technique used to help models train faster and more stable, the input of
networks keeps changing during training which is knows as internal covariate shift

Normalization:

Also known as min max scaling, it compresses all values to be between 1 and 0 but preserves the
shape of the data

Standardization:
transforms the data so the mean is 0 and variance is 1 which results in a normal distribution

Batch normalization process:


-select a mini batch

-compute mean : μ = (1/m)∑(x_i)

-compute variance: σ² = (1/m)∑(x_i - μ)²

-normalize : : x̂_i = (x_i - μ)/√(σ² + ε) where ε is to prevent division by zero

-scale and shift: y_i = γx̂_i + β where γ is a learnable scale and β is a learnable shift
Benefits:
-centers input around 0

-achieve accuracy faster

-better performance

-no standardization layer

-no regularization

-epochs will take longer but convergence is faster

Computer vision:
discover what is going on in the world and predict upcoming events based on it. This field has
applications in medicine, automation, accessibility and robotics

What do computers see:


to a computer an image is a matrix of numbers with width And height and depth, this depth
represents how many channels it has, an RGB image has three channels so a 10800*1080 image is
1080*1080*3

Computer vision tasks:


regression: predicting a continuous value

Classification: predicting what is the class of an object

Manual feature extraction:


to identify features manually, we first need domain knowledge, then a definition of the feature we
want to extract, then we detect features and classify them

Fully connected neural network:


a fully connected network has every neuron in the current layer connected to every neuro in the
previous layer

in this architecture, the input is a vector of pixel values. This leads to the loss of the spatial structure,
and being fully connected means there are a lot of weights
to solve this issue, we connect only patches of input to a neuron in the hidden layer, basically only
giving it a window of the original input

we slide this window across the whole window to define connections

this technique is called convolution

CNN:
this is a neural network that learns feature through applying filters or kernels and its usually compose
of:
- convolutional layers for feature extraction

-An activation to add non linearity , usually Relu

- pooling, either max or avg to downsize the resulting features

for a neuron in the hidden layer:


-it takes a patch

-calculates the weighted sum and adds the bias

The output of a convolutional network is a volume where height and width are spatial dimensions
and depth is the number of filters or feature maps extracted

The Relu activation functions replaces all negative values after a convolution with a 0

Pooling:
pooling is down Samling while retaining spatial invariance, it is done by sliding a filter across the
feature maps and picking the max value in max pooling or the avg value in avg pooling
CNN for classification:
after convolutions and activation and pooling , the resulting features are passed to a fully connected
layer for classification which expresses the possibility of an image belonging to a class

Lennet 5:
-introduced In 1989 for character recognition but was limited due to hardware

Key features:
-convulutional layers extract features

-tanh activation function

-fully connected layer for classification

-sparse connections reduce complexity

Architecture:
Advantages:
-simple and efficient for small sets

-low complexity

-demonstrates effectiveness of CNNs

Disadvantages:
-limited to small inputs 32*32

-not effective for complex set

-requires pre processing

VGG-NET:
developed by visual geometry group and is either 16 or 19 layers, used for image detection,
introduced in 2014

VGG 19 adds 3 extra convolutional layers which means slightly better performance

Advantages:
-easy to implement

-small filters extract better feature with reduced computational load

-widely available pre trained models

-versatile
Disadvantages:
-very slow to train

-heavy requirements

-inefficient

Applications:
-pre trained models in transfer learning

-object detection and feature extraction

GoogleNet:
introduced in 2014 and known as inception V1 and focuses on accuracy and efficiency

Why it came:

-shallow architectures like LeNet-5 and AlexNet

-vanishing gradients

-high computational costs in deeper networks

Architectural highlights:

Uses inception models for parallel processing at different scales

Use global average pooling

Uses auxiliary classifiers during training to solve vanishing gradients

What are inception modules:


introduced with google net and captures feature more efficiently than other models by processing
images in parallel
Evolution of inception architecture:
-Inception-v2: batch normalization and more efficient convolutions

-Inception-v3: atrous convolutions

-Inception-v4 and ResNet hybrids: residual connections for better training

What is googlenet used for:


-image classification

-object detection

-image segmentation

-video analysis

Advantages:
-efficient feature extraction

-reduced parameters

-improved accuracy

Disadvantages:
-complex architecture

-requires high experience to customize it

-computationally expensive during inference

Legacy of google net:


-set a foundation for modular architectures

-first for hybrid models

-inspired light weight models

Residual networks:
early architectures used more layers to reduce errors rates, but this led to more problems such as
vanishing gradients and higher computational costs
Why ResNets:
the issue Was that deeper networks tend to perform worse than shallow networks due to vanishing
or exploding gradients

This was solved by introducing residual blocks with skip connections

Residual blocks and skip connections:


provide two paths for information to flow, effectively allowing the network to bypass layers it deems
unnecessary, this allows the network to learn on residuals and solve vanishing and exploding
gradients

How resnet works:


inspired by VGG 19 architecture it is a 34 layer network with shortcut or skip connections. Used CIFAR
dataset to test depth up to 1000 layers

Why do they work:


-overcome problems with very deep networks

-poor performing layers are bypassed by the network

-superior accuracy

Sequence modeling applications:

-machine translation

-image captioning

-sentiment classification

Neurons with recurrence:


in a neuron with recurrence, there is an additional slot for the state of the neuron called hidden
state, each output of the neuron depends on its current input and its previous context or state.
Recurrent neural networks :
in an RNN, information is maintained about past inputs through a state. Every step through
processing a sequence, a recurrence relation is applied:

RNN state update and output:


to update the hidden state we take the input vector and:

Apply the tanh activation function to hidden state to hidden state weight plus the input to hidden
weight and you get the new state

To get the output:

We multiply the hidden state to output weights by the new state

Backpropagation through time:


When we unroll an RNN, we notice that at each instance T its calculating a hidden state and output,
then the model calculates the loss by comparing predicted with actual, the total loss is :

The gradient of this loss is then back propagated through the unrolled RNN

The problem of long term dependency:


when calculating BPTT, gradients are computed by repeatedly multiplying derivatives, if these weights
are small, repeated multiplication can cause then to exponentially shrink as we move backwards. This
means long term errors contribute less and less. This makes the network biased towards short term
dependencies

You might also like