Unit 3 DL
Unit 3 DL
Kernel
f * g (x) = f ( )g(x − )d
=−
N −1 Output is
= f ( )g(x − ) sometimes called
=0 Feature map
2D (continuous, discrete) :
f * g (x, y) = f ( , )g(x − , y − )dd
=− =−
N −1 N −1
= f ( , )g(x − , y − )
=0 =0
Convolution Properties
• Commutative:
f*g = g*f
• Associative:
(f*g)*h = f*(g*h)
• Homogeneous:
f*(g)= f*g
• Additive (Distributive):
f*(g+h)= f*g+f*h
• Shift-Invariant
f*g(x-x0,y-yo)= (f*g) (x-x0,y-yo)
ConvNet
• ConvNet architectures for images:
• ConvNets are specifically designed for processing grid-like data such as images. They take
advantage of the spatial structure and correlation present in images, making them well-
suited for tasks like image classification, object detection, and segmentation.
• Convolutional layers use parameter sharing, which significantly reduces the number of
parameters compared to fully-connected layers. This parameter sharing leverages the
assumption that local patterns and features are useful across different spatial locations.
• ConvNets have been highly successful in computer vision tasks due to their ability to
automatically learn hierarchical features from images while efficiently handling the spatial
structure of the data. This makes them a popular choice for tasks ranging from image
classification to more complex tasks like object detection and segmentation.
Convnets
translated
image image
• CONV will compute the output of neurons that are connected to local regions in the input. With
12 filters, the output volume is [32x32x12]
• POOL will perform a downsampling operation along the spatial dimensions (width, height),
resulting in volume such as [16x16x12].
• FC layer will compute the class scores, resulting in volume of size [1x1x10], where each of the 10
numbers correspond to a class score, such as among the 10 categories of CIFAR-10
Convolution Layer
• The Conv layer is the core building block of a CNN
• Every filter is small spatially (width and height), but extends through the full depth of the input volume, eg,
5x5x3
• During the forward pass, we slide (convolve) each filter across the width and height of the input volume
and compute dot products between the entries of the filter and the input at any position.
• produce a 2-dimensional activation map that gives the responses of that filter at every spatial position.
• Intuitively, the network will learn filters that activate when they see some type of visual feature
32x32x3 image
32 height
32 width
3 depth
Andrej Karpathy
Convolutions: More detail
32x32x3 image
5x5x3 filter
32
32
3
Andrej Karpathy
Convolutions: More detail
Convolution Layer
32x32x3 image
5x5x3 filter
32
1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3
Andrej Karpathy
Convolutions: More detail
Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32
28
32 28
3 1
Andrej Karpathy
Convolutions: More detail
consider a second, green filter
Convolution Layer
32x32x3 image activation maps
5x5x3 filter
32
28
32 28
3 1
Andrej Karpathy
Convolutions: More detail
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps
32
28
Convolution Layer
32 28
3 6
Andrej Karpathy
Convolutions: More detail
Preview: ConvNet is a sequence of Convolution Layers, interspersed with
activation functions
32 28
CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6
Andrej Karpathy
Convolutions: More detail
Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation
functions
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
Andrej Karpathy
Convolutions: More detail
[From recent Yann
Preview LeCun slides]
Andrej Karpathy
Convolutions: More detail
one filter =>
one activation map example 5x5 filters
(32 total)
28
32 28
3 1
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
• 7
• 7x7 input
(spatially)
assume 3x3
filter
• 7
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
• 7
• 7x7 input
(spatially)
assume 3x3
filter
• 7
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
• 7
• 7x7 input
(spatially)
assume 3x3
filter
• 7
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
• 7
• 7x7 input
(spatially)
assume 3x3
filter
• 7
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
• 7
• 7x7 input (spatially)
assume 3x3 filter
7 => 5x5 output
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.
Andrej Karpathy
Convolutions: More detail
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
F N
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
Andrej Karpathy
Convolutions: More detail
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0
3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
(recall:)
(N - F) / stride + 1
Andrej Karpathy
Convolutions: More detail
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0
3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
0
7x7 output!
Andrej Karpathy
Convolutions: More detail
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0
3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
0
7x7 output!
in general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
(N + 2*padding - F) / stride + 1
Andrej Karpathy
Convolutions: More detail
Examples time:
Andrej Karpathy
Convolutions: More detail
Examples time:
Andrej Karpathy
Convolutions: More detail
Examples time:
Andrej Karpathy
Convolutions: More detail
Examples time:
Andrej Karpathy
Convolutions: More detail
Andrej Karpathy
Spatial arrangement
• Three hyperparameters control the size of the
output volume
– Depth: no of filters, each learning to look for
something different in the input.
– the stride with which we slide the filter.
– pad the input volume with zeros around the
border.
Spatial arrangement
• We compute the spatial size of the output volume as a
function of
– the input volume size (W)
– the receptive field size of the Conv Layer neurons (F)
– the stride with which they are applied (S)
– the amount of zero padding used (P) on the border.
• The number of neurons that “fit” is given by (W−F+2P)/(S+1)
– For a 7x7 input and a 3x3 filter with stride 1 and pad 0 we would get a 5x5
output.
– With stride 2 we would get a 3x3 output.
– Understanding and Calculating the number of Parameters in
Convolution Neural Networks (CNNs) | by Rakshith Vasudev |
Towards Data Science
Parameter Sharing
• Parameter sharing controls the number of parameters.
• If there are 55*55*96 = 290,400 neurons in the first Conv Layer, and
each has 11*11*3 = 363 weights and 1 bias. Together, this adds up
to 290400 * 364 = 105,705,600 parameters on the first layer of the
ConvNet alone.
• Reduce by parameter sharing
• now have only 96 unique set of weights (one for each depth slice),
for a total of 96*11*11*3 = 34,848 unique weights, or 34,944
parameters (+96 biases)
• During backpropagation, every neuron in the volume will compute
the gradient for its weights, but these gradients will be added up
across each depth slice and only update a single set of weights per
slice.
Spatial Pooling
• Sum or max over non-overlapping / overlapping regions
• Role of pooling:
• Invariance to small transformations
• Larger receptive fields (neurons see more of input)
Max
Sum
Andrej Karpathy
Transfer Learning with CNNs
Source: classification on ImageNet Target: some other task/data
Freeze these
Freeze these
Train this
Train this
Andrej Karpathy
Image Segmentation
Segmentation divides an image into its constituent regions or objects.
Segmentation of images is a difficult task in image processing. Still under
research.
histogram spaces
color space
texture space