Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
18 views37 pages

TRes Net

Resnet notes

Uploaded by

Satyarth Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views37 pages

TRes Net

Resnet notes

Uploaded by

Satyarth Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

ResNet

Some slides were adated/taken from various sources, including Andrew Ng’s Coursera Lectures, CS231n: Convolutional Neural Networks for Visual Recognition lectures, Stanford University CS
Waterloo Canada lectures, Aykut Erdem, et.al. tutorial on Deep Learning in Computer Vision, Ismini Lourentzou's lecture slide on "Introduction to Deep Learning", Ramprasaath's lecture
slides, and many more. We thankfully acknowledge them. Students are requested to use this material for their study only and NOT to distribute it.
In this Lecture

• Introducing a breakthrough neural networks architecture introduced


on 2015.
• Why deep?
• What’s the problem in learning deep networks?
• ResNet and how it allow us to gain more performance via deeper
networks.
• Some results, improvements and farther works.

Technical ResNet
Intro ResNet Results Comparison
details 1000
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

Technical ResNet
Intro ResNet Results Comparison
details 1000
Deep vs Shallow Networks
What happens when we continue stacking deeper layers on a “plain” convolutional
neural network?

56-layer
Training error

56-layer

Test error
20-layer

20-layer

Iterations Iterations

56-layer model performs worse on both training and test error


-> The deeper model performs worse, but it’s not caused by overfitting!
Technical ResNet
Intro ResNet Results Comparison
details 1000
Deeper models are harder to optimize

• The deeper model should be able to perform at least as well as the


shallower model.
• A solution by construction is copying the learned layers from the
shallower model and setting additional layers to identity mapping.

Technical ResNet
Intro ResNet Results Comparison
details 1000
Challenges

• Deeper Neural Networks start to degrade in performance.


• Vanish/Exploding Gradient – May lead for extremely complex
parameters initializations to make it work. Still might suffer from
Vanish/Exploding even for the best parameters.
• Long training times – Due to too many training parameters.

Technical ResNet
Intro ResNet Results Comparison
details 1000
ResNet

• A specialized network introduced by Microsoft.


• Connects inputs of layers into farther part of that network to allow
“shortcuts”.
• Simple idea – great improvements with both performance and train
time.

Technical ResNet
Intro ResNet Results Comparison
details 1000
Plain Network

Technical ResNet
Intro ResNet Results Comparison
details 1000
Residual Blocks

Technical ResNet
Intro ResNet Results Comparison
details 1000
Residual Blocks
X Big NN a[l]

a[l] a[l+2]
X Big NN

a[l+2]=g(z[l+2]+a[l])
=g(w[l+2] a[l+2]+b[l+2] +a[l])=g(a[l])
if w[l+2]=0 and b[l+2] =0
Identity function is easy to learn for residual block
Skip Connections “shortcuts”

• Such connections are referred as skipped connections or shortcuts. In


general similar models could skip over several layers.
• They refer to residual part of the network as a unit with input and
output.
• Such residual part receives the input as an amplifier to its output –
The dimensions usually are the same.
• Another option is to use a projection to the output space.
• Either way – no additional training parameters are used.

Technical ResNet
Intro ResNet Results Comparison
details 1000
ResNet
He et. al. 2015

F(x) = w[l+2] a[l+2]+b[l+2] So H(x) of plain layers is replaced


x = a[l] by new H(x) = w[l+2] a[l+2]+b[l+2] + a[l]
g() is ReLU
Slide Credit: Fei Li et. al.
ResNet
He et. al. 2015

Referring the original residual function as H(x)


The residual part now fits a new function F(x)= H(x)-X
The original mapping recast into old H(x)+X
It is easier to learn residual F(x) Slide Credit: Fei Fei Li et. al.
ResNet as a ConvNet
• Till now we talked about fully connected layers.
• The ResNet idea could easily expended into convolutional model.
• Other adaptations of this idea could be easily introduced to almost
any kind of deep layered network.

Technical ResNet
Intro ResNet Results Comparison
details 1000
Softmax
FC 1000
Pool

ResNet Architecture 3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


3x3 conv, 512
Full ResNet architecture:
relu 3x3 conv, 512
- Stack residual blocks 3x3 conv, 512, /2
F(x) + x
- Every residual block has ..
.
two 3x3 conv layers 3x3 conv, 128
3x3 conv 3x3 conv, 128

F(x) X 3x3 conv, 128


relu 3x3 conv, 128
identity
3x3 conv, 128
3x3 conv 3x3 conv, 128, / 2

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
X 3x3 conv, 64
Residual block 3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2
Input

Technical
Intro ResNet Results ResNet 1000 Comparison
details
Softmax
FC 1000
Pool

ResNet Architecture 3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


3x3 conv, 512
Full ResNet architecture:
relu 3x3 conv, 512
- Stack residual blocks 3x3 conv, 512, /2
F(x) + x
- Every residual block has ..
.
two 3x3 conv layers 3x3 conv, 128

- Periodically, double # of 3x3 conv 3x3 conv, 128


3x3 conv, 128
filters and downsample F(x) X 3x3 conv, 128
filters, /2
relu 3x3 conv, 128
spatially with
identity
spatially using stride 2 3x3 conv, 128 stride 2
3x3 conv 3x3 conv, 128, / 2
(/2 in each dimension) 3x3 conv, 64 3x3 conv, 64
3x3 conv, 64
filters
3x3 conv, 64
X 3x3 conv, 64
Residual block 3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2
Input

Technical
Intro ResNet Results ResNet 1000 Comparison
details
Softmax
FC 1000
Pool

ResNet Architecture 3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


3x3 conv, 512
Full ResNet architecture:
relu 3x3 conv, 512
- Stack residual blocks 3x3 conv, 512, /2
F(x) + x
- Every residual block has ..
.
two 3x3 conv layers 3x3 conv, 128

- Periodically, double # of 3x3 conv 3x3 conv, 128

filters and downsample F(x) X 3x3 conv, 128


relu 3x3 conv, 128
identity
spatially using stride 2 3x3 conv, 128
3x3 conv 3x3 conv, 128, / 2
(/2 in each dimension) 3x3 conv, 64
- Additional conv layer at 3x3 conv, 64

the beginning X
3x3 conv, 64
3x3 conv, 64
Residual block 3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2 Beginning
Input conv layer
Technical
Intro ResNet Results ResNet 1000 Comparison
details
Softmax
FC 1000 No FC layers
Pool besides FC
ResNet Architecture 3x3 conv, 512
3x3 conv, 512
1000 to
output
classes
3x3 conv, 512
3x3 conv, 512
Full ResNet architecture: Global
relu 3x3 conv, 512 average
- Stack residual blocks 3x3 conv, 512, /2 pooling layer
F(x) + x after last
- Every residual block has ..
conv layer
.
two 3x3 conv layers 3x3 conv, 128

- Periodically, double # of 3x3 conv 3x3 conv, 128

filters and downsample F(x) X 3x3 conv, 128


relu 3x3 conv, 128
identity
spatially using stride 2 3x3 conv, 128
3x3 conv 3x3 conv, 128, / 2
(/2 in each dimension) 3x3 conv, 64
- Additional conv layer at 3x3 conv, 64

the beginning X
3x3 conv, 64
3x3 conv, 64

- No FC layers at the end Residual block 3x3 conv, 64

(only FC 1000 to output 3x3 conv, 64

Pool
classes) 7x7 conv, 64, / 2
Input

Technical
Intro ResNet Results ResNet 1000 Comparison
details
Softmax
FC 1000
Pool

ResNet Architecture 3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


3x3 conv, 512, /2
Total depths of 34, 50, 101, or
..
152 layers for ImageNet .
3x3 conv, 128
3x3 conv, 128

3x3 conv, 128


3x3 conv, 128

3x3 conv, 128


3x3 conv, 128, / 2

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2
Input

Technical
Intro ResNet Results ResNet 1000 Comparison
details
ResNet Architecture
28x28x256
output

For deeper networks 1x1 conv, 256


(ResNet-50+), use “bottleneck”
layer to improve efficiency 3x3 conv, 64
(similar to GoogLeNet)
1x1 conv, 64

28x28x256
input

Technical ResNet
Intro ResNet Results Comparison
details 1000
ResNet Architecture
28x28x256
output
1x1 conv, 256 filters projects
back to 256 feature maps
For deeper networks (28x28x256) 1x1 conv, 256
(ResNet-50+), use “bottleneck”
layer to improve efficiency 3x3 conv operates over
3x3 conv, 64
(similar to GoogLeNet) only 64 feature maps

1x1 conv, 64 filters 1x1 conv, 64


to project to
28x28x64 28x28x256
input

Technical ResNet
Intro ResNet Results Comparison
details 1000
Residual Blocks (skip connections)
Deeper Bottleneck Architecture

Technical ResNet
Intro ResNet Results Comparison
details 1000
Deeper Bottleneck Architecture (Cont.)
• Addresses high training time of very deep networks.
• Keeps the time complexity same as the two layered convolution
• Allows us to increase the number of layers
• allows the model to converge much faster.
• 152-layer ResNet has 11.3 billion FLOPS while VGG-16/19 nets has
15.3/19.6 billion FLOPS.

Technical ResNet
Intro ResNet Results Comparison
details 1000
Why Do ResNets Work Well?

Technical ResNet
Intro ResNet Results Comparison
details 1000
Why Do ResNets Work Well? (Cont
(Cont)
Cont)
• In theory ResNet is still identical to plain networks, but in practice due to
the above the convergence is much faster.
• No additional training parameters introduced.
• No addition complexity introduced.

Technical ResNet
Intro ResNet Results Comparison
details 1000
Training ResNet in practice
• Batch Normalization after every CONV layer.
• Xavier/2 initialization from He et al.
• SGD + Momentum (0.9)
• Learning rate: 0.1, divided by 10 when validation error
plateaus.
• Mini-batch size 256.
• Weight decay of 1e-5.
• No dropout used.
Technical ResNet
Intro ResNet Results Comparison
details 1000
Loss Function
• For measuring the loss of the model a combination of cross-entropy
and softmax were used.
• The output of the cross-entropy was normalized using softmax
function.

Technical ResNet
Intro ResNet Results Comparison
details 1000
Results
Experimental Results
- Able to train very deep
networks without degrading
(152 layers on ImageNet, 1202
on Cifar)
- Deeper networks now achieve
lowing training error as
expected
- Swept 1st place in all ILSVRC
and COCO 2015 competitions ILSVRC 2015 classification winner (3.6%
top 5 error) -- better than “human
performance”! (Russakovsky 2014)
Technical ResNet
Intro ResNet Results Comparison
details 1000
Comparing Plain to ResNet (18/
18/34 Layers)

Technical ResNet
Intro ResNet Results Comparison
details 1000
Comparing Plain to Deeper ResNet
Test Error: Train Error:

Technical ResNet
Intro ResNet Results Comparison
details 1000
ResNet on More than 1000 Layers
• To farther improve learning of extremely deep ResNet “Identity
Mappings in Deep Residual Networks Kaiming He, Xiangyu Zhang,
Shaoqing Ren, and Jian Sun 2016” suggests to pass the input directly
to the final residual layer, hence allowing the network to easily learn
to pass the input as identity mapping both in forward and backward
passes.

Technical ResNet
Intro ResNet Results Comparison
details 1000
Identity Mappings in Deep Residual Networks

Technical ResNet
Intro ResNet Results Comparison
details 1000
Identity Mappings in Deep Residual Networks
Improvement on CIFAR-
CIFAR-10

• Another important improvement – using the Batch


Normalization as pre-activation improves the
regularization.
• This improvement leads to better performances for
smaller networks as well.

Technical ResNet
Intro ResNet Results Comparison
details 1000
Reduce Learning Time with Random Layer
Drops
• Dropping layers during training, and using the full network in testing.
• Residual block are used as network’s building block.
• During training, input flows through both the shortcut and the weights.
• Training: Each layer has a “survival probability” and is randomly dropped.
• Testing: all blocks are kept active.
• re-calibrated according to its survival probability during training.

Technical ResNet
Intro ResNet Results Comparison
details 1000

You might also like