Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
20 views72 pages

Unit 3 DL

unit 3 for lpu ums

Uploaded by

mallickshaban2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views72 pages

Unit 3 DL

unit 3 for lpu ums

Uploaded by

mallickshaban2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Classifying Images with Deep

Convolutional Neural Networks


UNIT-III
History
• In 1995, Yann LeCun
and Yoshua Bengio
introduced the concept
of convolutional neural
networks.
Convolution
1D (continuous, discrete) : Input

 Kernel
f * g (x) =  f ( )g(x −  )d
 =−
N −1 Output is
=  f ( )g(x −  ) sometimes called
 =0 Feature map
2D (continuous, discrete) :
 
f * g (x, y) =   f ( ,  )g(x −  , y −  )dd
 =−  =−
N −1 N −1
=   f ( ,  )g(x −  , y −  )
 =0  =0
Convolution Properties
• Commutative:
f*g = g*f
• Associative:
(f*g)*h = f*(g*h)
• Homogeneous:
f*(g)=  f*g
• Additive (Distributive):
f*(g+h)= f*g+f*h
• Shift-Invariant
f*g(x-x0,y-yo)= (f*g) (x-x0,y-yo)
ConvNet
• ConvNet architectures for images:

– fully-connected structure does not scale to large images


– the explicit assumption that the inputs are images
– allows us to encode certain properties into the architecture.
– These then make the forward function more efficient to implement
– Vastly reduce the amount of parameters in the network.

• 3D volumes: neurons arranged in 3 dimensions: width, height,


depth.
• Fully-connected layers in traditional neural networks can be inefficient for processing
large images. Each neuron in a fully-connected layer is connected to every neuron in the
previous layer, resulting in a large number of parameters and computational complexity.

• ConvNets are specifically designed for processing grid-like data such as images. They take
advantage of the spatial structure and correlation present in images, making them well-
suited for tasks like image classification, object detection, and segmentation.

• ConvNets are designed to capture hierarchical features in an image by using convolutional


layers. These layers learn local patterns and progressively combine them to form complex,
high-level features. This hierarchical feature extraction is crucial for recognizing patterns
in images.
• Convolutional layers in ConvNets are designed to exploit the local connectivity and shared
weights, making the forward pass computationally more efficient compared to fully-
connected layers. This design choice helps reduce the overall computational cost.

• Convolutional layers use parameter sharing, which significantly reduces the number of
parameters compared to fully-connected layers. This parameter sharing leverages the
assumption that local patterns and features are useful across different spatial locations.

• ConvNets have been highly successful in computer vision tasks due to their ability to
automatically learn hierarchical features from images while efficiently handling the spatial
structure of the data. This makes them a popular choice for tasks ranging from image
classification to more complex tasks like object detection and segmentation.
Convnets

Layers used to build ConvNets:


• a stacked sequence of
layers. 3 main types
• Convolutional Layer,
Pooling Layer, and Fully-
Connected Layer • every layer of a ConvNet
transforms one volume of
activations to another through a
differentiable function.
The replicated feature approach
• Use many different copies of the
same feature detector with The red connections all
different positions. have the same weight.
– Could also replicate across scale and
orientation (tricky and expensive)
– Replication greatly reduces the
number of free parameters to be
learned.
• Use several different feature types,
each with its own map of
replicated detectors.
– Allows each patch of image to be
represented in several ways.
Backpropagation with weight constraints

• It’s easy to modify the To constrain : w1 =w 2


backpropagation algorithm to we need : w1 =w 2
incorporate linear constraints
between the weights. E E
compute: and
• We compute the gradients as w1 w2
usual, and then modify the
gradients so that they satisfy E E
use + for w1 and w2
the constraints. w1 w2
– So if the weights started off
satisfying the constraints, they
will continue to satisfy them.
What does replicating the feature detectors achieve?
• Equivariant activities: Replicated features do not make the
neural activities invariant to translation. The activities are
equivariant.
representation translated
by active representation
neurons

translated
image image

• Invariant knowledge: If a feature is useful in some locations


during training, detectors for that feature will be available in
all locations during testing.
Pooling the outputs of replicated feature
detectors
• Get a small amount of translational invariance at each level by averaging four
neighboring replicated detectors to give a single output to the next level.
– This reduces the number of inputs to the next layer of feature extraction, thus allowing
us to have many more different feature maps.
– Taking the maximum of the four works slightly better.

• Problem: After several levels of pooling, we have lost information about


the precise positions of things.
– This makes it impossible to use the precise spatial relationships between high-level
parts for recognition.
Example Architecture for CIFAR-10
• [INPUT - CONV - RELU - POOL - FC]

• INPUT [32x32x3] : the raw pixel values of the image

• CONV will compute the output of neurons that are connected to local regions in the input. With
12 filters, the output volume is [32x32x12]

• RELU : apply an elementwise activation function, such as the max(0,x)

• POOL will perform a downsampling operation along the spatial dimensions (width, height),
resulting in volume such as [16x16x12].
• FC layer will compute the class scores, resulting in volume of size [1x1x10], where each of the 10
numbers correspond to a class score, such as among the 10 categories of CIFAR-10
Convolution Layer
• The Conv layer is the core building block of a CNN

• The parameters consist of a set of learnable filters.

• Every filter is small spatially (width and height), but extends through the full depth of the input volume, eg,
5x5x3

• During the forward pass, we slide (convolve) each filter across the width and height of the input volume
and compute dot products between the entries of the filter and the input at any position.

• produce a 2-dimensional activation map that gives the responses of that filter at every spatial position.
• Intuitively, the network will learn filters that activate when they see some type of visual feature

• A set of filters in each CONV layer


– each of them will produce a separate 2-dimensional activation map
– We will stack these activation maps along the depth dimension and produce the output volume.
Convolutional Neural Network 2
Convolution
Convolutions: More detail

32x32x3 image

32 height

32 width
3 depth

Andrej Karpathy
Convolutions: More detail
32x32x3 image

5x5x3 filter
32

Convolve the filter with the image


i.e. “slide over the image spatially,
computing dot products”

32
3

Andrej Karpathy
Convolutions: More detail
Convolution Layer
32x32x3 image
5x5x3 filter
32

1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3

Andrej Karpathy
Convolutions: More detail
Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1

Andrej Karpathy
Convolutions: More detail
consider a second, green filter
Convolution Layer
32x32x3 image activation maps
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1

Andrej Karpathy
Convolutions: More detail
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps

32

28

Convolution Layer

32 28
3 6

We stack these up to get a “new image” of size 28x28x6!

Andrej Karpathy
Convolutions: More detail
Preview: ConvNet is a sequence of Convolution Layers, interspersed with
activation functions

32 28

CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6

Andrej Karpathy
Convolutions: More detail
Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation
functions

32 28 24

….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10

Andrej Karpathy
Convolutions: More detail
[From recent Yann
Preview LeCun slides]

Andrej Karpathy
Convolutions: More detail
one filter =>
one activation map example 5x5 filters
(32 total)

We call the layer convolutional


because it is related to convolution
of two signals:

elementwise multiplication and sum of


a filter and the signal (image)

Adapted from Andrej Karpathy, Kristen Grauman


Convolutions: More detail
A closer look at spatial dimensions:
activation map
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1

Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:

• 7
• 7x7 input
(spatially)
assume 3x3
filter

• 7

Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:

• 7
• 7x7 input
(spatially)
assume 3x3
filter

• 7

Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:

• 7
• 7x7 input
(spatially)
assume 3x3
filter

• 7

Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:

• 7
• 7x7 input
(spatially)
assume 3x3
filter

• 7

Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:

• 7
• 7x7 input (spatially)
assume 3x3 filter
7 => 5x5 output

Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7

Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?

Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?

7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.

Andrej Karpathy
Convolutions: More detail
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
F N
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\

Andrej Karpathy
Convolutions: More detail
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0
3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0

(recall:)
(N - F) / stride + 1

Andrej Karpathy
Convolutions: More detail
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0
3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0

0
7x7 output!

Andrej Karpathy
Convolutions: More detail
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0
3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0

0
7x7 output!
in general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3

(N + 2*padding - F) / stride + 1
Andrej Karpathy
Convolutions: More detail
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Output volume size: ?

Andrej Karpathy
Convolutions: More detail
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Output volume size:


(32+2*2-5)/1+1 = 32 spatially, so
32x32x10

Andrej Karpathy
Convolutions: More detail
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Number of parameters in this layer?

Andrej Karpathy
Convolutions: More detail
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Number of parameters in this layer?


each filter has 5*5*3 + 1 = 76 params (+1 for bias)
=> 76*10 = 760

Andrej Karpathy
Convolutions: More detail

Andrej Karpathy
Spatial arrangement
• Three hyperparameters control the size of the
output volume
– Depth: no of filters, each learning to look for
something different in the input.
– the stride with which we slide the filter.
– pad the input volume with zeros around the
border.
Spatial arrangement
• We compute the spatial size of the output volume as a
function of
– the input volume size (W)
– the receptive field size of the Conv Layer neurons (F)
– the stride with which they are applied (S)
– the amount of zero padding used (P) on the border.
• The number of neurons that “fit” is given by (W−F+2P)/(S+1)
– For a 7x7 input and a 3x3 filter with stride 1 and pad 0 we would get a 5x5
output.
– With stride 2 we would get a 3x3 output.
– Understanding and Calculating the number of Parameters in
Convolution Neural Networks (CNNs) | by Rakshith Vasudev |
Towards Data Science
Parameter Sharing
• Parameter sharing controls the number of parameters.
• If there are 55*55*96 = 290,400 neurons in the first Conv Layer, and
each has 11*11*3 = 363 weights and 1 bias. Together, this adds up
to 290400 * 364 = 105,705,600 parameters on the first layer of the
ConvNet alone.
• Reduce by parameter sharing
• now have only 96 unique set of weights (one for each depth slice),
for a total of 96*11*11*3 = 34,848 unique weights, or 34,944
parameters (+96 biases)
• During backpropagation, every neuron in the volume will compute
the gradient for its weights, but these gradients will be added up
across each depth slice and only update a single set of weights per
slice.
Spatial Pooling
• Sum or max over non-overlapping / overlapping regions
• Role of pooling:
• Invariance to small transformations
• Larger receptive fields (neurons see more of input)

Max

Sum

Adapted from Rob Fergus


3. Spatial Pooling
• Sum or max over non-overlapping / overlapping regions
• Role of pooling:
• Invariance to small transformations
• Larger receptive fields (neurons see more of input)

Rob Fergus, figure from Andrej Karpathy


Pooling Layer
• Insertion of pooling layer:
– reduce the spatial size of the representation
reduce the amount of parameters and computation in the network, and
hence also control overfitting.
• The Pooling Layer operates independently on every depth slice of
the input and resizes it spatially, using the MAX operation.
• The most common form is a pooling layer with filters of size 2x2
applied with a stride of 2 -- downsamples every depth slice in the
input by 2 along both width and height,
• MAX operation would in take a max over 4 numbers (little 2x2
region in some depth slice).
• The depth dimension remains unchanged.
General pooling layer
• Accepts a volume of size W1×H1×D1
• Requires two hyperparameters:
– their spatial extent F
– the stride S
• Produces a volume of size W2×H2×D2 where:
– W2=(W1−F)/S+1
– H2=(H1−F)/S+1
– D2=D1
• Introduces zero parameters
• Other pooling functions: Average pooling, L2-
norm pooling
General pooling

• Backpropagation. the backward pass for a max(x, y) operation


routes the gradient to the input that had the highest value in
the forward pass.
• Hence, during the forward pass of a pooling layer you may
keep track of the index of the max activation (sometimes also
called the switches) so that gradient routing is efficient during
backpropagation.
Fully-connected layer
• Neurons in a fully connected layer have full connections to all
activations in the previous layer
• Their activations can hence be computed with a matrix
multiplication followed by a bias offset.
• Converting FC layers to CONV layers
• the only difference between FC and CONV layers is that the
neurons in the CONV layer are connected only to a local
region in the input, and that many of the neurons in a CONV
volume share parameters.
• However, the neurons in both layers still compute dot
products, so their functional form is identical.
Converting FC layers to CONV layers
• For any CONV layer there is an FC layer that implements the same forward
function.
• The weight matrix would be a large matrix that is mostly zero except for at
certain blocks (due to local connectivity) where the weights in many of the
blocks are equal (due to parameter sharing).
• Conversely, any FC layer can be converted to a CONV layer.
• For example, an FC layer with K=4096 that is looking at some input volume
of size 7×7×512
• can be equivalently expressed as a CONV layer with F=7,P=0,S=1,K=4096.
• In other words, we are setting the filter size to be exactly the size of the
input volume, and hence the output will simply be 1×1×4096 since only a
single depth column “fits” across the input volume, giving identical result
as the initial FC layer.
ConvNet Architectures
Layer Patterns
• The most common architecture
• stacks a few CONV-RELU layers,
• follows them with POOL layers,
• and repeats this pattern until the image has been merged spatially
to a small size.
• At some point, it is common to transition to fully-connected layers.
The last fully-connected layer holds the output, such as the class
scores. In other words, the most common ConvNet architecture
follows the pattern:
INPUT -> [[CONV -> RELU]*N -> POOL?]*M ->[FC -> RELU]*K -> FC
• N >= 0 (and usually N <= 3), M >= 0, K >= 0
Prefer a stack of small filter CONV to one large receptive field CONV layer.
three layers of 3x3 CONV vs a single CONV layer with 7x7
receptive fields.
• The receptive field size is identical in spatial extent (7x7), but
with several disadvantages.
1. The neurons would be computing a linear function over the input,
while the three stacks of CONV layers contain non-linearities that
make their features more expressive.
2. If we suppose that all the volumes have C channels, the single 7x7
CONV layer would contain C×(7×7×C)=49C2 parameters, while the
three 3x3 CONV layers would contain 3×(C×(3×3×C))=27C2
parameters.
• Intuitively, stacking CONV layers with tiny filters as opposed to
having one CONV layer with big filters allows us to express
more powerful features of the input, and with fewer
parameters.
Practical matters
Data Augmentation (Jittering)
• Create virtual trainin g samples
– Horizontal flip
– Random crop
– Color casting
– Geometric distortion

Deep Image [Wu et al. 2015]


Jia-bin Huang
Transfer Learning

“You need a lot of a data if you want to


train/use CNNs”

Andrej Karpathy
Transfer Learning with CNNs
Source: classification on ImageNet Target: some other task/data

1. Train on 2. Small dataset: 3. Medium dataset:


ImageNet finetuning

more data = retrain more of


the network (or all of it)

Freeze these

Freeze these

Train this

Train this

Another option: use network as feature extractor,


train SVM on extracted features for target task

Adapted from Andrej Karpathy


Transfer Learning with CNNs

very similar very different


dataset dataset
more generic
very little data Use linear You’re in
classifier on top trouble… Try
more specific layer linear classifier
from different
stages

quite a lot of Finetune a few Finetune a


data layers larger number of
layers

Andrej Karpathy
Image Segmentation
Segmentation divides an image into its constituent regions or objects.
Segmentation of images is a difficult task in image processing. Still under
research.

Segmentation allows to extract objects in images.

Segmentation is unsupervised learning.

Model based object extraction, e.g., template matching, is supervised


learning.
What it is useful for
After a successful segmenting the image, the contours of objects can be
extracted using edge detection and/or border following techniques.

Shape of objects can be described.

Based on shape, texture, and color objects can be identified.


Image segmentation techniques are extensively used in similarity searches,
e.g.:
http://elib.cs.berkeley.edu/photos/blobworld/
Segmentation Algorithms

Segmentation algorithms are based on one of two basic properties


of color, gray values, or texture: discontinuity and similarity.

First category is to partition an image based on abrupt changes in


intensity, such as edges in an image.

Second category are based on partitioning an image into regions


that are similar according to a predefined criteria. Histogram
thresholding approach falls under this category.
❖Domain spaces
spatial domain (row-column (rc) space)

histogram spaces

color space

texture space

other complex feature space

You might also like