Convolutional Neural
network
Unit -II
Introduction to Convolutional Neural
Network
• CNN is a deep learning neural network designed for processing
structured arrays of data such as images. CNN is a feed forward neural
network with upto 20 or 30 layers.
• It is also called ConvNet
• In CNN, ‘Convolution ‘ is referred to as the mathematical function,
which can multiply two functions to create a third function.
• Two images that are represented in the form of two matrices, are
multiplied to provide an output that is used to extract information
from the image
Introduction to Convolutional Neural
Network
• CNN represents the input data in the form of multidimensional arrays. It
works well with a large number of labelled data. CNN extract the each
and every portion of input called as receptive field. It assigns weight for
each neuron based on the significant role of the receptive field.
• Instead of preprocessing the data to derive the features like textures
and shapes, CNN just take the image’s raw pixel data as input and learns
how to extract these features and ultimately infer what object they
constitute.
• The goal of CNN is to reduce the images so that it would be easier to
process with losing features that are valuable for accurate prediction.
Introduction to Convolutional Neural
Network
• CNN is made up or numerous layers, such as convolutional layers,
pooling layers and fully connected layers and it uses a back-
propagation algorithm to learn spatial hierarchies of data
automatically and adaptively.
• Most of the digital companies have opted for CNN for image
recognition, ex:google, amazon, Instagram, Facebook etc.
• A neural network consisting of multiple convolutional layers which are
used mainly used for image processing, classification, segmentation
and other correlated data
Advantages of CNN
• CNN automatically detects the important features without any
human supervision
• Computationally efficient
• Higher accuracy
• Weigh sharing
• Minimize computation in comparison with regular neural network
• Make use of the same knowledge across all image locations
Disadvantages
• Adversarial attacks are cases of feeding the network ‘bad’ examples to
cause misclassification
• CNN required lot of training data
• Tends to much slower because of operations like maxpool
Applications of CNN
• Image classification, Ex. To determine the satellite images containing
mountains and valleys or recognition of handwriting etc, image
segmentation, signal processing
• Object detection: Self driving cars, AI powered surveillance systems,
smart homes, can identify objects on the photos and in real time,
classify and label them
• Voice synthesis :google assistant’s voice synthesizes uses Deepmind’s
wavenet and Convnet model
• Astrophysics: To make sense of radio telescope data and predict the
probable visual image to represent that data
Basic structure of CNN
Basic architecture of CNN
• Image dimension is of 12*12*4
1. Input layer : This layer will accept the image of weight 12, height 12, depth 4
2. Convolution layer: It computes the volume of the image by getting the dot
product between the image filters and the image patch. If there are 10
filters available, then the volume will be computed as 12*12*10.
3. Activation function layer: Applies activation function to each element of the
output of the convolutional layer. The activation functions are ReLu,
Sigmoid, Tanh, Leaky ReLu etc. This will not change the volume obtained at
the convolution layer and hence it will remain equal to 12*12*10
4. Pool layer: Mainly reduces the volume of the intermediate output which
enables fast computation of the model.
Convolution operation
• Extracts/preserve important features from the input. It allows the
network to detect horizontal and vertical edges of an image and then
based on those edges build high level features.
• Suppose, we are tracking the location of a spaceship with a laser
sensor. Laser sensor provides a single output x(t), the position of the
spaceship at time t. Both ‘x’ and ‘t’ are real valued, i.e. we can get a
different reading from the laser sensor at any instant in time
• Now suppose that our laser sensor is somewhat noisy, to obtain the
less noisy estimate, we would like to take the weighted average of
measurements
Convolution operation
• Uses three parameters : input image, feature detector and feature
map
• Convolution operation involves an input matrix and a filter also known
as kernel. Input matrix can be pixel values of a grayscale image
whereas a filter is relatively small matrix that detect edges by
darkening areas of input image where there are transitions from
brighter to darker areas.
• Filters –vertical, horizontal or diagonal
• Input image is converted into binary 1 or 0. The convolution operation
is known as the feature detector of a CNN
Convolution operation
• The feature detector is referred to as kernel or a filter. At each step,
the kernel is multiplied by the input data values within its bounds,
creating a single entry in the output feature map.
Convolution operation
• The size of the image matrix is image height*image width * number of
channels
• A grayscale image has 1 channel and color image has 3 channels
• Kernel : A Kernel is a small matrix of numbers used in image convolutions.
Example of Kernel
Convolution operation
• CNN develop multiple feature detectors and use them to develop
several feature maps which are referred to as convolutional layers.
• Though training, the network determines what features are important
in order to be able to scan images and categorize them more
accurately.
• Gradient descent is used to train the parameters in this layer
Convolution operation
• Components of Convolutional layers
a) Filters
b) Activation maps
c) Parameter sharing
d) Layer specific hyper-parameters
• Filters are a function that has width and height smaller than the
width and height of the input volume.
Sparse Interactions
• Making the kernel smaller than the input.
• CNN traditional NN
• Fewer parameters reduces the storage requirements and improves its statistical
efficiency.
computing the output requires fewer operations.
Sparse Interactions
• since we have deep layers. Even though direct connections in a
convolutional net are very sparse, units in the deeper layers can be
indirectly connected to all or most of the input image.
Parameter sharing
• Parameter sharing is used in CNN to control the total parameter
count. Convolutional layers reduce the parameter count further by
using a technique called parameter sharing.
• The kernel weights are shared across the input and instead of learning
a separate set of kernel parameters for different locations, we learn
only a single set of parameters.
• A shot of a cat, for example, can be translated one pixel to the right
and still be a shot of a cat. By sharing parameters across several
picture locations, CNNs take this property into account. Different
locations in the input are computed with the same feature (a hidden
unit with the same weights). This indicates that whether the cat
appears in column i or column i + 1 in the image, we can find it with
the same cat detector.
Parameter Sharing
• Parameter sharing is the uses of same parameter for more than one
function in a model.
• Note: In a traditional NN, each element of the weight matrix is used
exactly once when computing the output of a layer.
• CNN tied weights:
• Value of the weight applied to one input is tied to the value
• of a weight applied in other location in the CNN
• Convolution shares the same parameters across all spatial locations
Equivariant Representation
• Convolution function is equivariant to translation. This means that shifting
the input and applying convolution is equivalent to applying convolution to
the input and shifting it.
• If we move the object in the input, its representation will move the same
amount in the output.
• General Definition:
representation (Transform(x))=transform(representation(x))
• Convolution is not equivariant to other operations such as change in scale or
rotation.
• shifting righ/left or up/down the input 2D image does not change the output
of CNN
Padding
• It is a process of adding one or more pixels of zeros all around the
boundaries of an image, in order to increase its effective size.
• Zero padding helps to make output dimensions and kernel size
independent.
• Convolution operation reduces the size of the (q+1)th layer in
comparison with the size of qth layer. This type of reduction in size is
not desirable because it tends to lose some information along the
borders of the image.
Padding
• 3 common zero padding strategies are
• a) valid convolution : In the valid padding, no padding is added to the
input feature map, and the output feature map is smaller than the input
feature map. This is useful when we want to reduce the spatial
dimensions of the feature maps.
• b) Same Padding: In the same padding, padding is added to the input
feature map such that the size of the output feature map is the same as
the input feature map. This is useful when we want to preserve the
spatial dimensions of the feature maps.
• C) Full Padding : Other extreme cases, where enough zeroes are added
for every pixel to be visited k times in each dimension.
Padding
• The number of pixels to be added for padding can be calculated based
on the size of the kernel and the desired output of the feature map
size. The most common padding value is zero-padding, which involves
adding zeros to the borders of the input feature map.
• Padding can help in reducing the loss of information at the borders of
the input feature map and can improve the performance of the
model. However, it also increases the computational cost of the
convolution operation
Stride
• Convolution layer consists of application of several different kernels
to the input. This allows the extraction of several different features at
all locations in the input. This means that in each layer, a single kernel
is not applied, multiple kernels are used as different feature detectors
• The stride indicates the pace by which the filter moves horizontally
and vertically over the pixels of the input image during convolution.
• Stride is a parameter of the neural network’s filter that modifies the
amount of movement over the image or video. Stride is a component
for the compression of images and video data.
ReLu layer
• We remove every negative value from the filtered image and replace it
with zero. This function only activates when the node input is above a
certain quantity. So, when the input is below zero, the output is zero.
• At its core, the ReLU function applies a very straightforward rule: if the
input is greater than zero, it leaves it unchanged; otherwise, it sets it to
zero.
• The reason why the rectifier function is used as the activation function in a
CNN is to increase the nonlinearity of the dataset. By removing negative
values from the neurons input signals, the rectifier function is effectively
removing black pixels from the image and replacing them with gray pixels
# Defining the ReLU Function in Python
def relu(x):
return max(0, x)
Pooling
• Pooling layer reduces the height and width of the input.
• It helps to reduce
- Computation
- Amount of parameters
- control overfitting
• No learning takes place
• Pooling operation is added after the convolution layer – may be
repeated one or more times in a given model
• Pooling layer reduce the size of each feature map by a factor of 2.
• Pooling layer applied to a feature map of 6*6 (36 pixels) will result in
an output pooled feature map of 3*3 (9 pixels).
Pooling
• It is also called subsampling – used to reduce the dimensionality of
the feature maps from the convolution operation
• Max pooling and average pooling – common pooling operations
Global Pooling
Pooling
• Max pooling: This works by selecting the maximum value from every
pool. Max Pooling retains the most prominent features of the feature
map, and the returned image is sharper than the original image.
• Average pooling: This pooling layer works by getting the average of
the pool. Average pooling retains the average values of features of the
feature map. It smoothes the image while keeping the essence of the
feature in an image.
• Global Pooling : Reduces each channel in the feature map to a single
value.
from numpy import asarray
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import AveragePooling2D
# define input data
data =
[[0, 0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 0, 0]]
data = asarray(data)
data = data.reshape(1, 8, 8, 1)
# create model Layer (type) Output Shape Param #
model = Sequential() ===================================================
model.add(Conv2D(1, (3,3), activation='relu', input_shape=(8, ==============
8, 1))) conv2d_1 (Conv2D) (None, 6, 6, 1) 10
___________________________________________________
model.add(AveragePooling2D())
______________
# summarize model
average_pooling2d_1 (Average (None, 3, 3, 1) 0
model.summary() ===================================================
# define a vertical line detector ==============
detector = [[[[0]],[[1]],[[0]]], Total params: 10
[[[0]],[[1]],[[0]]], Trainable params: 10
[[[0]],[[1]],[[0]]]] Non-trainable params: 0
weights = [asarray(detector), asarray([0.0])] ___________________________________________________
______________
# store the weights in the model
model.set_weights(weights) [0.0, 3.0, 0.0]
# apply filter to input data [0.0, 3.0, 0.0]
yhat = model.predict(data) [0.0, 3.0, 0.0]
# enumerate rows
for r in range(yhat.shape[1]):
# print each column in the row
print([yhat[0,r,c,0] for c in range(yhat.shape[2])])
Convolution Variants
Convolution Variants :
Transposed
Convolution Variants :
Transposed
• otherwise called as Deconvolutions or fractionally strided
convolutions and is not popular in the field of deep learning
• It increase the size of the output feature map. The idea is to regain
the original spatial resolution and performs a convolution.
• It carries out a regular convolution but reverts its spatial
transformation.
• Figure below show how transposed convolution with a 2*2 kernel is
computed for 2*2 input tensor
Convolution Variants :
Transposed
Transposed convolution with a 2*2 kernel
Convolution Variants : Dilated
• Expands the window size without
increasing the number of weights by
increasing zero values into
convolution kernels.
• Can be used in real time applications
where the processing power is less as
the RAM requirements
• It is also called as atrous
convolutions. The dilation parameter
will decide the spacing between the
filter weight while performing
convolution.
Convolution Variants : Dilation
factor is d=2
Fully connected layer
• The neuron applies a linear transformation to the input vector
through a weight matrix. A non linear transformation is then applied
to the product through a non-linear activation function f.
Y= f(
Interleaving between layers
• The convolution, pooling and ReLU layers are typically interleaved in a
neural network in order to increase the expressive power of the
network.
• The ReLU layers often follow the convolutional layers.
• After two or three sets of convolutional-ReLU combinations, one
might have a max –pooling layer.
• Ex. CRCRP, CRCRCRP
• Here ‘C’ is convolutional layer, ‘R’ is ReLU layer and the max –
pooling layer is denoted by ‘P’.
• Ex. CRCRPCRCRPCRCRPF.
LeNet 5 –Digit Classification
LeNet 5 –Digit Classification
• LeNet-5 is a convolutional neural network that was created by Lecunn
in 1998. It includes 7 layers excluding the input layer
• It consists of two parts :
i) a convolutional encoder consisting of convolutional layers
ii) a dense block consisting of three fully connected layers
•
LeNet 5 –Digit Classification
• Layer C1 is a convolutional layer with 6 feature maps where the size of the
feature map is 28*28.
• Layer s2 is a subsampling layer with 6 feature map where the size is
14*14.
• Layer C2 is a convolutional layer with 16 feature maps – size is 10*10
• Layer S4 is a sub sampling layer with 16 feature maps – size is 5*5
• Layer C5 is a convolutional layer with 120 feature maps – size is 1*1
Layer F6 contains 84 units connected to C5 connected layer
CNN learning – Non linearity
functions
• The weight layers in a CNN are often followed by a nonlinear
activation functions
• Common activation fucntions are Sigmoid, Tanh, Algebraic Sigmoid,
ReLu, Leaky ReLu and Exponential Linear units
Loss Function
• A loss function computes the difference between the estimated
output of the model (prediction) and the correct output.
• All the algorithms in ML rely on minimizing or maximizing the function
which we call as “Objective Function”
• The group of functions that are minimized are called as Loss Function
• A loss function is used to calculate the difference between the
predicted output and actual output.
• It can be classified into two groups, one for classification (discrete
values 0,1,2) and other for regression (continuous values)
Loss Function for Regression
• Loss Functions for Regression involves predicting a specific value that is
continuous in nature.
• Estimating the price of a house or predicting stock prices are examples
• Mean Square Error: It is a commonly used regression loss function. It is a sum
of squared distances between our target variables and predicted values
• Advantages :For small errors, MSE helps to converge to the minima efficiently
• Drawback : a) Squaring the values does increases the rate of training,
• b) MSE is also sensitive to outliers
Loss Function for Classification
• Loss functions for classification involves
predicting a discrete class output. It
involves dividing the dataset into
different and unique classes based on
different parameters so that a new and
unseen record can be put into one of
these classes.
1. Hinge loss
• Is a specific loss function used by SVM.
It helps SVM to make a decision
boundary with a certain margin
distance.
• It is used for binary classification
Loss Function for Classification
2. Square Hinge loss
• It simply calculates the squared of the hinge loss value
• It smoothes the surface of the error function
• A typical application can be used for classifying the email into spam
and not spam.