0% found this document useful (0 votes)

20 views72 pages

Unit 3 DL

unit 3 for lpu ums

Uploaded by

mallickshaban2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views72 pages

Unit 3 DL

unit 3 for lpu ums

Uploaded by

mallickshaban2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

Classifying Images with Deep

Convolutional Neural Networks

UNIT-III
History
• In 1995, Yann LeCun
and Yoshua Bengio
introduced the concept
of convolutional neural
networks.
Convolution
1D (continuous, discrete) : Input

 Kernel
f * g (x) =  f ( )g(x −  )d
 =−
N −1 Output is
=  f ( )g(x −  ) sometimes called
 =0 Feature map
2D (continuous, discrete) :
 
f * g (x, y) =   f ( ,  )g(x −  , y −  )dd
 =−  =−
N −1 N −1
=   f ( ,  )g(x −  , y −  )
 =0  =0
Convolution Properties
• Commutative:
f*g = g*f
• Associative:
(f*g)*h = f*(g*h)
• Homogeneous:
f*(g)=  f*g
• Additive (Distributive):
f*(g+h)= f*g+f*h
• Shift-Invariant
f*g(x-x0,y-yo)= (f*g) (x-x0,y-yo)
ConvNet
• ConvNet architectures for images:

– fully-connected structure does not scale to large images

– the explicit assumption that the inputs are images
– allows us to encode certain properties into the architecture.
– These then make the forward function more efficient to implement
– Vastly reduce the amount of parameters in the network.

• 3D volumes: neurons arranged in 3 dimensions: width, height,

depth.
• Fully-connected layers in traditional neural networks can be inefficient for processing
large images. Each neuron in a fully-connected layer is connected to every neuron in the
previous layer, resulting in a large number of parameters and computational complexity.

• ConvNets are specifically designed for processing grid-like data such as images. They take
advantage of the spatial structure and correlation present in images, making them well-
suited for tasks like image classification, object detection, and segmentation.

• ConvNets are designed to capture hierarchical features in an image by using convolutional

layers. These layers learn local patterns and progressively combine them to form complex,
high-level features. This hierarchical feature extraction is crucial for recognizing patterns
in images.
• Convolutional layers in ConvNets are designed to exploit the local connectivity and shared
weights, making the forward pass computationally more efficient compared to fully-
connected layers. This design choice helps reduce the overall computational cost.

• Convolutional layers use parameter sharing, which significantly reduces the number of
parameters compared to fully-connected layers. This parameter sharing leverages the
assumption that local patterns and features are useful across different spatial locations.

• ConvNets have been highly successful in computer vision tasks due to their ability to
automatically learn hierarchical features from images while efficiently handling the spatial
structure of the data. This makes them a popular choice for tasks ranging from image
classification to more complex tasks like object detection and segmentation.
Convnets

Layers used to build ConvNets:

• a stacked sequence of
layers. 3 main types
• Convolutional Layer,
Pooling Layer, and Fully-
Connected Layer • every layer of a ConvNet
transforms one volume of
activations to another through a
differentiable function.
The replicated feature approach
• Use many different copies of the
same feature detector with The red connections all
different positions. have the same weight.
– Could also replicate across scale and
orientation (tricky and expensive)
– Replication greatly reduces the
number of free parameters to be
learned.
• Use several different feature types,
each with its own map of
replicated detectors.
– Allows each patch of image to be
represented in several ways.
Backpropagation with weight constraints

• It’s easy to modify the To constrain : w1 =w 2

backpropagation algorithm to we need : w1 =w 2
incorporate linear constraints
between the weights. E E
compute: and
• We compute the gradients as w1 w2
usual, and then modify the
gradients so that they satisfy E E
use + for w1 and w2
the constraints. w1 w2
– So if the weights started off
satisfying the constraints, they
will continue to satisfy them.
What does replicating the feature detectors achieve?
• Equivariant activities: Replicated features do not make the
neural activities invariant to translation. The activities are
equivariant.
representation translated
by active representation
neurons

translated
image image

• Invariant knowledge: If a feature is useful in some locations

during training, detectors for that feature will be available in
all locations during testing.
Pooling the outputs of replicated feature
detectors
• Get a small amount of translational invariance at each level by averaging four
neighboring replicated detectors to give a single output to the next level.
– This reduces the number of inputs to the next layer of feature extraction, thus allowing
us to have many more different feature maps.
– Taking the maximum of the four works slightly better.

• Problem: After several levels of pooling, we have lost information about

the precise positions of things.
– This makes it impossible to use the precise spatial relationships between high-level
parts for recognition.
Example Architecture for CIFAR-10
• [INPUT - CONV - RELU - POOL - FC]

• INPUT [32x32x3] : the raw pixel values of the image

• CONV will compute the output of neurons that are connected to local regions in the input. With
12 filters, the output volume is [32x32x12]

• RELU : apply an elementwise activation function, such as the max(0,x)

• POOL will perform a downsampling operation along the spatial dimensions (width, height),
resulting in volume such as [16x16x12].
• FC layer will compute the class scores, resulting in volume of size [1x1x10], where each of the 10
numbers correspond to a class score, such as among the 10 categories of CIFAR-10
Convolution Layer
• The Conv layer is the core building block of a CNN

• The parameters consist of a set of learnable filters.

• Every filter is small spatially (width and height), but extends through the full depth of the input volume, eg,
5x5x3

• During the forward pass, we slide (convolve) each filter across the width and height of the input volume
and compute dot products between the entries of the filter and the input at any position.

• produce a 2-dimensional activation map that gives the responses of that filter at every spatial position.
• Intuitively, the network will learn filters that activate when they see some type of visual feature

• A set of filters in each CONV layer

– each of them will produce a separate 2-dimensional activation map
– We will stack these activation maps along the depth dimension and produce the output volume.
Convolutional Neural Network 2
Convolution
Convolutions: More detail

32x32x3 image

32 height

32 width
3 depth

Andrej Karpathy
Convolutions: More detail
32x32x3 image

5x5x3 filter
32

Convolve the filter with the image

i.e. “slide over the image spatially,
computing dot products”

32
3

Andrej Karpathy
Convolutions: More detail
Convolution Layer
32x32x3 image
5x5x3 filter
32

1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3

Andrej Karpathy
Convolutions: More detail
Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32

convolve (slide) over all

spatial locations

32 28
3 1

Andrej Karpathy
Convolutions: More detail
consider a second, green filter
Convolution Layer
32x32x3 image activation maps
5x5x3 filter
32

convolve (slide) over all

spatial locations

32 28
3 1

Andrej Karpathy
Convolutions: More detail
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps

Convolution Layer

32 28
3 6

We stack these up to get a “new image” of size 28x28x6!

Andrej Karpathy
Convolutions: More detail
Preview: ConvNet is a sequence of Convolution Layers, interspersed with
activation functions

32 28

CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6

Andrej Karpathy
Convolutions: More detail
Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation
functions

32 28 24

….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10

Andrej Karpathy
Convolutions: More detail
[From recent Yann
Preview LeCun slides]

Andrej Karpathy
Convolutions: More detail
one filter =>
one activation map example 5x5 filters
(32 total)

We call the layer convolutional

because it is related to convolution
of two signals:

elementwise multiplication and sum of

a filter and the signal (image)

Adapted from Andrej Karpathy, Kristen Grauman

Convolutions: More detail
A closer look at spatial dimensions:
activation map
32x32x3 image
5x5x3 filter
32