Deep Learning Workshop Series
CNN Architectures
Plan A
CONTENT
★ CNN Architectures (with interactive implementation)
○ VGG-Net: 3x3 vs 11x11 Convolution
○ Inception-Net: “1x1 convolution” vs “Fully Connected”
○ MobileNet: Depthwise (Separable) Convolutions for Training Light Models
○ ShuffleNet(not implemented)
○ SqueezeNet: Distributed Training of Networks
○ ResNet: Residuals in Convolution Operations
○ DenseNet: Dense Connections in Convolution Operations
★ Extras (short summary only)
○ Feature Pyramid Networks (FPNs)
○ Neural ODEs
Plan A
PROGRAM FLOW
10:00 - 10:30 10:30 - 11:10 11:10 - 12:00 12:00 - 12:45 12:45 - 13:05 13:05 - 13:50
Depthwise
3x3 vs 11x11 “1x1 conv” vs “FC” BREAK Channel shuffling SqueezeNet
(separable) conv
10min 15min 20min 25min 30min 20min 20min 20min 25min
13:50 - 14:20 14:20 - 14:30 14:30 - 15:00 15:00 - 15:30 15:30 - 15:45 16:00 -
Residuals & Dense
Implementation Implementation of
Connections in BREAK Extras Q&A
of Resnet DenseNe
ConvNets
30min 30min 30min 15min
Plan B
CONTENT
★ Part 1 : A Brief Intro to Convolution Operations
★ Part 2 : Popular CNN Architectures (interactive implementation)
○ VGG-Net: 3x3 vs 11x11 Convolution
○ Inception-Net: “1x1 convolution” vs “Fully Connected”
○ MobileNet: Depthwise (Separable) Convolutions for Training Light Models
○ SqueezeNet: Distributed Training of Networks
○ ResNet: Residuals in Convolution Operations
○ DenseNet: Dense Connections in Convolution Operations
★ Extras (short summary only)
○ Feature Pyramid Networks (FPNs)
○ Neural ODEs
Plan B
PROGRAM FLOW
10:00 - 10:30 10:30 - 11:00 11:00 - 12:45 12:45 - 13:30 13:30 - 14:20 14:20 - 15:00
Part 1 : A Historical
Depthwise
Review on Deep 3x3 vs 11x11 “1x1 conv” vs “FC” BREAK SqueezeNet
(separable) conv
CNNs
15min 15min 10min 15min 20min 25min 30min 20min 20min 25min
15:00 - 15:10 15:10 - 15:30 15:30 - 16:00 16:00 - 16:30 16:30 -
Residuals & Dense
Implementation Implementation of
BREAK Connections in CLOSING Q&A
of Resnet DenseNe
ConvNets
20min 30min 30min
Part 1 : A Brief Intro to Convolution Operations
Computer Vision
Computer vision is a field deals with how to
gain high-level understanding from images or
videos
Images are represented by pixel values..
64 x 64 x 3
3-Channel RGB
Goal :
Extracting meaningful features from an image
edges, corners, colors,
shapes, patterns,
statistical (histogram, )..
Tool:
Algorithms
- gray-scaling, thresholding,
- complex descriptors (HOG, SIFT, SURF etc.)
HOG,
SIFT,
SURF..
full-of hand-
engineering
..well, then
..by generalizing all these techniques with applying
convolution operations along with neural nets
● Convolutions are filtering operations
● Different filter (kernel) sizes unveil different image feature information
● Commonly in two steps:
1. Slide the same fixed kernel across the image
2. Calculate the dot product between kernel and image
Blurring
*
Sharpening
*
Let’s try together
* Edge
Extracting Image Features via ConvOps
Extracting useful Sliding windows (kernels or
information from image filters) are used to convolve
an input image
panda
Feature Learning
Simple features Complex features
(edges,colors) (shapes,textures)
Network Structure
1 1 2 4
kernel size 3x3
5 6 7 8
stride 1 Max 2x2, 2
3 2 1 0
padding same
1 2 3 4
Convolution Visualizer
A brief history...
Convolution operations are first introduced into Machine Learning by Yann LeCun at AT&T Laboratories (Y. LeCun et. al.
1989, Fukushima 1980, A.Weibel 1987)
Backpropagation first applied Improved by Gradient-descent
LeCun et al 1989 - LeNet Architecture (1998)
LeNet5 by Lecun et al 1998
r1 3
e r y e r5 Input image
ye y
La La La
REFERENCES
[1] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, Backpropagation Applied to Handwritten Zip Code
Recognition; AT&T Bell Laboratories
[2] LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick Haffner (1998). "Gradient-based learning applied to document recognition" (PDF).
Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791. Retrieved October 7, 2016.
[3] The History of Neural Networks
by Eugenio Culurciello
https://dataconomy.com/2017/04/history-neural-networks/
[4] Convolutions by AI Shack
Utkarsh Sinha
http://aishack.in/tutorials/image-convolution-examples/
[5] The History of Neural Networks
Andrew Fogg
https://www.import.io/post/history-of-deep-learning/
[6] Overview of Convolutional Neural Networks for Image Classification
Intel Academy https://software.intel.com/en-us/articles/hands-on-ai-part-15-overview-of-convolutional-neural-networks-for-image-
classification
[7] Convolution Arithmetic
https://github.com/vdumoulin/conv_arithmetic
Snippet Implementation
Part 2 : Convolutions in Deep Architectures
3x3 vs 11x11
Filter size: 11x11
Bigger filter size, more
global information
captured
B
G
R
Filter size: 3x3
Smaller filter size
more local
information captured
AlexNet vs VGG
● Convolution filter sizes:
○ Alexnet : 11x11 , 3x3 , 5x5
○ VGGNet : 3x3
● How deep networks:
○ AlexNet : 7 Layer
○ VGGNet : 16 Layer
Efficiency & Computation
AlexNet VGG-16
# of Convolution Layers 5 13
Convolution Parameters 3.8M 15M
# of FC Layers 3 3
FC Layer Parameters 59M 59M
Total Params 62M 138M
ImageNet Error 17% 7.3%
REFERENCES
[1] Different Kinds of Convolutional Filters
Soham Chatterjee
https://www.saama.com/blog/different-kinds-convolutional-filters/
[2] Image recognition by Deep Learning on mobile
https://qiita.com/negi111111/items/c46635b5d70058ebae93
[3] Paper Explanation : VGGNet
Mohit Jain
https://mohitjain.me/2018/06/07/vggnet/
[4] CNN Architectures - VGGNet
Gary Chang
https://medium.com/deep-learning-g/cnn-architectures-vggnet-e09d7fe79c45
Interactive Implementation
1x1 (pointwise) conv
Proposed in “Network-in-network” by Min.Lin et.al (2013)
● Micro network - mlpconv
○ uses multilayer perceptron, FC layers
● Better discriminate the model for local patches
● Cross channel pooling
○ Weighted linear combination of feature maps → ReLU
○ Complex and learnable interactions of cross channel information
1x1 (pointwise) conv
● adds non-linearity ● doesn’t care about spatial information
● feature pooling: shrinks the # of channels ● can replace fully-connected layers
6x6x32 1x1x32x16 6x6x16
❏ Decreases the computations (NxN conv → 1x1 conv) ❏ Decreases the parameters (FC → 1x1 conv)
❏ Get more valuable combination of filters: represent “M” features with “N” features
Inception Module
● CNN design has a lot of parameters;
● Do them all (at once)!!!
○ Conv:
● Let the network learn;
■ 3x3?
○ Whatever parameter,
■ 5x5?
○ Whatever the combination of these
■ 1x1?
○ Pooling: filter sizes it wants to learn
● Inception Layer
■ 3x3?
Inception → GoogLeNet
Computation (convolution) per inception layer
● 5x5 conv ⇒ (28x28)x(5x5x192)x(32) ≈ 120M
● 3x3 conv ⇒ (28x28)x(3x3x192)x(128) ≈ 170M
● 1x1 conv ⇒ (28x28)x(1x1x192)x(64) ≈ 10M
● In total; ≈ 300M computations
Input tensor (spatial) One Kernel Output tensor
“Bottleneck layer”:
● 1x1 conv ⇒ 5x5 conv
○ (28x28) x (1x1x192)x(32) ≈ 2.4M
+
○ (28x28) x (5x5x16)x(32) ≈ 10M
○ ≈ 12.4M
● 10x less computation!
Inception → GoogLeNet
GoogLeNet (Inception v1)
Vanishing gradient problem?
● Add two auxiliary classifiers
○ Prevent from “dying out” of middle part of network
○ Regularization effect: prevent from overfitting
● Total loss = real_loss + 0.3 * aux_loss_1 + 0.3 * aux_loss_2
Inception v2
The premise: The solution:
● Reduce “representational bottleneck” - loss of ● Factorize 5x5 conv to two 3x3 conv.
information by reducing the dimensions too much ○ To improve computational speed
● Smart factorization methods ○ 5x5 conv is 2.78x more expensive than a
3x3 conv.
REFERENCES
[1] Network In Network
Min Lin, Qiang Chen, Shuicheng Yan
https://arxiv.org/pdf/1312.4400v3.pdf
[2] Network in Networks and 1x1 Convolutions
Andrew Ng
https://www.coursera.org/lecture/convolutional-neural-networks/networks-in-networks-and-1x1-convolutions-ZTb8x
[3] One by One [1x1] Convolution - counter-intuitively useful
Aaditya Prakash
https://iamaaditya.github.io/2016/03/one-by-one-convolution/
[4] Deep Learning series: Convolutional Neural Networks
Mike Cavaioni
https://medium.com/machine-learning-bites/deeplearning-series-convolutional-neural-networks-a9c2f2ee1524
[5] Going Deeper with Convolutions
Christian Szegedy et.al.
http://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf
[6] A Simple Guide to the Versions of the Inception Network
Bharath Raj
https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202
Interactive Implementation
Depthwise convolution
Difference:
● So far;
○ 2D convolutions performed over all input channels
○ Lets us to mix channels
● Depthwise convolution;
○ Each channel kept separate
Approach:
● Split the input tensor into channels & split the kernel into
channels
● For each channels, convolve the input with the
corresponding filter → 2D tensor
● Stack the output (2D) tensors back together
Depthwise separable conv
● Depthwise convolution is commonly used in combination
with an additional step → depthwise separation convolution
○ 1. Filtering
○ 2. Combining
Input tensor (spatial) One Kernel Output tensor
Depthwise separation convolution:
● Depthwise convolution → 1x1 convolution across channels
● Much less operations
○ Input: 8x8x3, output: 8x8x256
○ Original conv:
■ (8x8) x (5x5x3) x (256) → 1,228,800
○ Depthwise separable conv:
■ (8x8) x (5x5x1) x (3) → 3800
■ (8x8) x (1x1x3) x (256) → 49,152
■ Total: 53,952
○ 1,228,800 / 53,952 ≈ 23x less multiplication
MobileNet
● Depthwise separable convolution
● Shrinking hyperparameters:
○ Width multiplier (𝛼): adjusts the channel numbers
○ Resolution multiplier (ρ): adjusts input image and
feature map spatial dimensions
REFERENCES
[1] Depthwise separable convolutions for machine learning
Eli Bendersky
https://eli.thegreenplace.net/2018/depthwise-separable-convolutions-for-machine-learning/
[2] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Andrew G.Howard, et. al. (Google Inc.)
https://arxiv.org/pdf/1704.04861.pdf
[3] Xception: Deep Learning with Depthwise Separable Convolutions
Francois Chollet
https://arxiv.org/pdf/1610.02357.pdf
[4] A Basic Introduction to Separable Convolutions
Chi-Feng Wang
https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728
Interactive Implementation
Group Convolutions
Grouped Convolutions:
● Proposed first in AlexNet
○ Memory constraints
● Decreases number of operations
○ 2 groups → 2x less operation
● (+) Learns better representations
○ Feature relationships are sparse
● (-) outputs from a certain channel are only derived from a small fraction of input channels
Channel shuffling
Depthwise separable convolution:
● Eliminate main side effect of grouped convolutions:
○ Outputs from a certain channel are only derived from a small fraction
of input channels
● Solution:
○ Conv1_output → channel shuffling → conv2_input
● Applies group convolutions on 1x1 layer also
● (!) channel shuffling is also differentiable
Grouped Convolutions:
● First proposed in AlexNet
○ Memory constraints
● Decreases number of operations
○ 2 groups → 2x less operation
● (+) Learns better representations
○ Feature relationships are sparse
● (-) outputs from a certain channel are only derived from a small fraction of input channels
ShuffleNet
Channel shuffling:
● Applies group convolutions on 1x1 layer
also
○ By grouping filters, computation
decreases significantly
● (!) remind the side effect of grouped
convolutions
○ Channel shuffling addresses this
issue
● (!) channel shuffling is also differentiable
Bottleneck unit with ShuffleNet unit with
depthwise convolution pointwise group convolution
REFERENCES
[1] Convolutions Type
Illarion Khlestov
https://ikhlestov.github.io/pages/machine-learning/convolutions-types/#depthwise-separable-convolutions-
separable-convolutions
[2] A Tutorial on Filter Groups (Grouped Convolution)
Yani Ioannou
https://blog.yani.io/filter-group-tutorial/
[3] ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun
https://arxiv.org/abs/1707.01083
SqueezeNet
is;
● Smart and small architecture which proposes:
● → AlexNet level accuracy (on ImageNet) with
○ 50x fewer parameters
○ 500x fewer parameters after compression
● → 3 times faster
● → Fully Convolutional Network (FCN), i.e. no FC Layer
SqueezeNet
SqueezeNet uses multiple tricks: Gains OR analysis outcome:
● Replace 3x3 filters with 1x1 (pointwise) filters ● reduce the computation by 1/9
● Uses 1x1 filters as a bottleneck layer ● reduce depth → reduce
● Use 3x3 filters in fire module ● Affects final accuracy
● Late downsampling ● Preserve feature map spatial dimensions
● Network compression ● Smaller networks also can be compressed
● 1x1 vs. 3x3 rate analysis ● Accuracy of trade-off
● Bypass connections ● Helps to alleviate representational bottleneck
which affected by squeeze layer (in fire module)
SqueezeNet
Fire module:
● Squeeze layer: only 1x1 filters (bottleneck)
● Expand layer: 1x1 and 3x3 filters
REFERENCES
[1] Notes on SqueezeNet
Hao Gao
https://medium.com/@smallfishbigsea/notes-of-squeezenet-4137d51feef4
[2] Review: SqueezeNet (Image Classification)
Sik-Ho Tsang
https://towardsdatascience.com/review-squeezenet-image-classification-e7414825581a
[3] SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size
Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, Kurt Keutzer
https://arxiv.org/abs/1602.07360
Interactive Implementation
Residuals in ConvNets
Power of going deeper = Richer contextual information
low level features
high level features
n - Layer n - Layer n - Layer
Network Network Network
gradient norm
iteration
is there any limitation to having more depth?
Gradients vanishing after repeated layer operations.
accuracy Gradient norm
epochs
Recap : Backpropagation
Recap : Backpropagation
relation of outputs to inputs
Recap : Backpropagation
gradients of a single node.. imagine all calculations..
Recap : Backpropagation
All we want a good parameter update!
Recap : Backpropagation
weak gradients
Backward gradient flow
forward operations
strong gradients
The core idea of ResNet is introducing a “identity shortcut (residual) connection”
Standard connection Skips one or more layers Easy gradient flow via shortcuts
identity
Plain
Input
ResNet
Input
ResNeXT
Input
REFERENCES
[1] DenResNet: Ensembling Dense Networks and Residual Networks
Victor Cheung
http://cs231n.stanford.edu/reports/2017/pdfs/933.pdf
[2] An Overview of ResNet and its Variants
Vincent Fung
https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035
[3] The Efficiency of Densenet
Hao Gao
https://medium.com/@smallfishbigsea/densenet-2b0889854a92
[4] Understanding and Implementing Architectures of ResNet and ResNeXt
Prakash Jay
https://medium.com/@14prakash/understanding-and-implementing-architectures-of-resnet-and-resnext-for-state-of-the-art-
image-cc5d0adf648e
[5] Hand-Gesture Classification using Deep Convolution and Residual Neural Network
Sandipan Dey
https://sandipanweb.wordpress.com/2018/01/20/hand-gesture-classification-using-deep-convolution-and-residual-neural-network-
with-tensorflow-keras-in-python/
Extras:
ResNet vs ResNexT vs Inception-Resnet
● Trend of split-transform-merge ● Inception style in ResNet
● Minor changes on ResNet ● Similar convolution topology.
ResNet ResNeXT Inception-Resnet
Extras:
Extras:
Interactive Implementation
Connections in ConvNets
connect every layer to one another
Transition Layer
(1x1 conv, Pooling)
Standard Connectivity
Successive convolutions
Resnet Connectivity
Element-wise feature
summation
DenseNet Connectivity
Feature concatenation
Standard ResNet DenseNet
x y
● power of feature reuse
ResNet DenseNet
Maintains low complexity..
● More shortcut connections, better gradient flow
Supervision to gradients
● Less parameters , computationally efficient
★ Bottleneck Layer
● Error vs parameters & computation
REFERENCES
[1] DenResNet: Ensembling Dense Networks and Residual Networks
Victor Cheung
http://cs231n.stanford.edu/reports/2017/pdfs/933.pdf
[2] An Overview of ResNet and its Variants
Vincent Fung
https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035
[3] The Efficiency of Densenet
Hao Gao
https://medium.com/@smallfishbigsea/densenet-2b0889854a92
[4] Understanding and Implementing Architectures of ResNet and ResNeXt
Prakash Jay
https://medium.com/@14prakash/understanding-and-implementing-architectures-of-resnet-and-resnext-for-state-of-the-art-
image-cc5d0adf648e
[5] Hand-Gesture Classification using Deep Convolution and Residual Neural Network
Sandipan Dey
https://sandipanweb.wordpress.com/2018/01/20/hand-gesture-classification-using-deep-convolution-and-residual-neural-network-
with-tensorflow-keras-in-python/
Part 3 : Extras
State-of-the-art:
# of parameters # of flops
Source: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
Neural ODEs
● So far we were using the layer (discrete) approach in neural
networks,
● It worked well to differentiate classes
● but lacks when it comes to continuous events
● such as health records taken at random times..
Is possible to achieve continuity?
A Neural Net
Let’s look a bit closer then;
A Plain Network A Residual Network
Very similar to an
Euler-equation
Re-parameterizing continuous dynamics of
hidden states by an ODE
ResNet ODE-Net
Feature Pyramid Networks
Top-down pathway restores resolution with rich semantic information
Bottom-Up Pathway:
applies ResNet to downscale by convolutions
Top-Down Pathway:
applies 1x1 convolutions and neares neighbour to
downscale and element-wise addition of feature
maps
Interactive Implementation
Appendix : State-of-the-art
Cheat Sheet
Convolution Operations Convolution in CNN Architectures 1x1 conv
>Convolutions are basically a filtering > feature pooling
operation used in CV world > convolution: filtering > decreases parameter
>Extracting useful information > stride: sliding step size > decreases computation
from images > padding: control output size > adds nonlinearity
> Sliding windows (kernels or > pooling: downsampling
filter are used to convolve an input 6x6x32 1x1x32x16 6x6x16
image
Inception/GoogLeNet Depthwise Convolution MobileNet
> use bottleneck layer > each kernel is kept separately > depthwise separable convolutions
> decreases computation (10x) > split input & kernels into channels > shrinking hyperparameters
> auxiliary loss layers > convolve each input channel with > width multiplier: adjusts # of channels
> factorize bigger conv layers corresponding filter channel > resolution multiplier: adjusts input
> regularization > stack the output (2D) tensors back image and feature map resolutions
together
Channel Shuffling-ShuffleNet 11x11 vs 3x3 Filter size: 3x3 Residual Nets
> eliminate main side effect of grouped convs > Bigger filters capture, more global information > Identity shortcut (residual) connections
> side effect: outputs are derived only from > Smaller filters captures more local information > Helps for gradient flow
certain channels, shuffles the channels after > AlexNet uses 11x11, 55x55 and 3x3 > Skipping one or more layers
grouped convolutions > VGGNet uses only 3x3 filters > Deeper architectures works better
> apply group convolutions also on 1x1 layers > By VGGNet, effectivity of going deeper proved
> note: channel shuffling is also differentiable!
ResNeXt DenseNet
> Inception style in ResNet >Connecting all layers to the other layers
>Depth concatenation, same >Strong gradient flow THANK YOU FOR JOINING US TODAY!
convolution topology >More diversified features
>Having high cardinality helps in > Allowing feature re-use Machine Learning Tokyo
decreasing validation error >More memory hungry,
>New hyper-parameter : > computationally more efficient
cardinality → width size