LeNet-5
• Architecture:
o Input: 32x32 grayscale image
o Layers:
▪ CONV1: 6x5x5 filters (C1)
▪ POOL1: 2x2 subsampling (S2)
▪ CONV2: 16x5x5 filters (C3)
▪ POOL2: 2x2 subsampling (S4)
▪ CONV3: 120 5x5 filters (C5)
▪ FC1: 84 neurons (F6)
▪ Output: 10 neurons (Output layer)
Total Layers: 7
• Working:
o The first convolutional layer (C1) extracts features such as edges from the input
image.
o The first pooling layer (S2) reduces the spatial dimensions, retaining essential
features.
o The second convolutional layer (C3) extracts more complex features.
o The second pooling layer (S4) further reduces spatial dimensions.
o The third convolutional layer (C5) processes these features.
o Fully connected layers (F6 and Output) classify the image based on the extracted
features.
• Pros:
o Simple architecture, efficient for small datasets like MNIST.
o Demonstrated the power of convolutional networks for digit recognition.
• Cons:
o Limited scalability to more complex tasks.
o Less effective on larger, more diverse datasets.
• Comparison: Foundational model, paved the way for more complex CNN architectures.
AlexNet
• Architecture:
o Input: 227x227x3 RGB image
o Layers:
▪ CONV1: 96 11x11 filters, stride 4
▪ POOL1: 3x3 max pooling, stride 2
▪ NORM1: Local Response Normalization
▪ CONV2: 256 5x5 filters, stride 1, pad 2
▪ POOL2: 3x3 max pooling, stride 2
▪ NORM2: Local Response Normalization
▪ CONV3: 384 3x3 filters, stride 1, pad 1
▪ CONV4: 384 3x3 filters, stride 1, pad 1
▪ CONV5: 256 3x3 filters, stride 1, pad 1
▪ POOL3: 3x3 max pooling, stride 2
▪ FC1: 4096 neurons
▪ FC2: 4096 neurons
▪ Output: 1000 neurons (one per class)
Total Layers: 8
• Working:
o Uses ReLU activations to introduce non-linearity.
o Local Response Normalization helps in generalization.
o The convolutional layers extract hierarchical features from the input image.
o Pooling layers reduce spatial dimensions and computational load.
o Fully connected layers classify the image based on the extracted features.
o Dropout is used in fully connected layers to prevent overfitting.
• Pros:
o Introduced ReLU, dropout, and data augmentation.
o Demonstrated the power of deep learning on large-scale image recognition tasks.
• Cons:
o High computational requirements.
o Large memory usage.
• Comparison: Revolutionized computer vision, basis for subsequent architectures.
ZFNet
• Architecture:
o Similar to AlexNet with modifications:
▪ CONV1: 96 7x7 filters, stride 2
▪ POOL1: 3x3 max pooling, stride 2
▪ NORM1: Local Response Normalization
▪ CONV2: 256 5x5 filters, stride 2, pad 2
▪ POOL2: 3x3 max pooling, stride 2
▪ NORM2: Local Response Normalization
▪ CONV3: 384 3x3 filters, stride 1, pad 1
▪ CONV4: 384 3x3 filters, stride 1, pad 1
▪ CONV5: 256 3x3 filters, stride 1, pad 1
▪ POOL3: 3x3 max pooling, stride 2
▪ FC1: 4096 neurons
▪ FC2: 4096 neurons
▪ Output: 1000 neurons (one per class)
Total Layers: 8
• Working:
o Refinement of AlexNet, used to visualize intermediate layers.
o Employed Deconvolutional Network (DeconvNet) to project features back to
pixel space.
o Provides better understanding and visualization of learned features.
• Pros:
o Improved accuracy and feature visualization.
o Provided insights into inner workings of CNNs.
• Cons:
o Increased complexity and computational demand.
• Comparison: Refinement of AlexNet, slight performance improvement, better feature
visualization.
VGGNet
• Architecture:
o Input: 224x224x3 RGB image
o Layers:
▪ Stack of CONV layers: Only 3x3 filters with stride 1 and pad 1.
▪ Pooling layers: 2x2 max pooling with stride 2.
▪ Example: CONV3-64, POOL2, CONV3-128, POOL2, etc.
▪ Final layers: FC layers with 4096 neurons each, followed by a softmax
output layer.
Total Layers: 16 (13 convolutional layers + 3 fully connected layers)
• Working:
o Consistent use of small filters (3x3) throughout the network.
o Increased depth to improve learning capacity.
o Max pooling layers reduce spatial dimensions while retaining important features.
o Fully connected layers perform classification.
• Pros:
o Uniform design, deeper networks with smaller filters.
o Improved accuracy on large datasets.
• Cons:
o Large number of parameters, high computational cost.
o Memory-intensive, especially in fully connected layers.
• Comparison: Outperformed earlier networks, deeper and more uniform structure,
influenced future architectures.
GoogLeNet (Inception)
• Architecture:
o Input: 224x224x3 RGB image
o Inception Modules: Combination of 1x1, 3x3, and 5x5 convolutions, along with
max pooling, concatenated along the depth dimension.
o Output: 1000 neurons (one per class)
Total Layers: 22 (including 9 inception modules, each with multiple layers)
• Working:
o Utilizes multiple filter sizes in parallel to capture different features.
o Inception modules enable a deeper network with fewer parameters.
o Removes fully connected layers, reducing the number of parameters.
o Incorporates auxiliary classifiers to improve gradient flow.
• Pros:
o Efficient architecture with reduced parameters.
o High accuracy with less computational cost.
• Cons:
o Complex architecture, challenging to tune.
o Inception modules require careful design.
• Comparison: Won ILSVRC 2014, introduced inception modules, significantly fewer
parameters than VGGNet, more efficient.
ResNet
• Architecture:
o Input: 224x224x3 RGB image
o Residual Blocks: Identity shortcut connections to skip layers.
o Example: CONV1, CONV2_x, CONV3_x, ..., FC (1000 neurons)
o Residual block: Two 3x3 convolution layers with identity shortcuts.
Total Layers: 50 (includes multiple residual blocks)
• Working:
o Enables training of very deep networks by using identity shortcuts to bypass one
or more layers.
o Mitigates vanishing gradient problem by providing a direct path for gradients.
o Allows deeper networks to be trained effectively.
• Pros:
o Improved accuracy, easier training of deep networks.
o Faster convergence, better performance on large-scale tasks.
• Cons:
o Increased complexity, more computations required.
• Comparison: Won ILSVRC 2015, set new benchmarks in deep learning, foundational
for extremely deep networks.
These models represent significant advancements in the field of deep learning and computer
vision, each building upon the strengths and addressing the limitations of its predecessor
Inception Modules of GoogLeNet
Inception modules are a core component of the GoogLeNet architecture, designed to capture a
wide range of features by using multiple filter sizes and pooling operations in parallel. Here's a
detailed breakdown:
Structure of an Inception Module
An Inception module consists of four parallel paths:
1. 1x1 Convolution Path:
o Reduces the dimensionality and computational cost by decreasing the depth of the
input volume.
o Acts as a bottleneck layer that helps in reducing the computational complexity.
2. 1x1 Convolution followed by 3x3 Convolution Path:
o The 1x1 convolution reduces the depth of the input volume, thus reducing
computational cost.
o The 3x3 convolution extracts medium-sized features from the input.
3. 1x1 Convolution followed by 5x5 Convolution Path:
o Similar to the previous path, the 1x1 convolution reduces the depth of the input.
o The 5x5 convolution extracts larger features from the input.
4. 3x3 Max Pooling followed by 1x1 Convolution Path:
o The 3x3 max pooling reduces the spatial dimensions while retaining the most important
features.
o The 1x1 convolution after pooling reduces the depth of the pooled feature maps.
Example of an Inception Module
Let's consider an example where the input to the Inception module is a feature map of size
28×28×19228 \times 28 \times 19228×28×192:
1. 1x1 Convolution Path:
o 1×1 conv,64 filters1 \times 1 \text{ conv}, 64 \text{ filters}1×1 conv,64 filters
o Output: 28×28×6428 \times 28 \times 6428×28×64
2. 1x1 Convolution followed by 3x3 Convolution Path:
o 1×1 conv,96 filters1 \times 1 \text{ conv}, 96 \text{ filters}1×1 conv,96 filters
o 3×3 conv,128 filters3 \times 3 \text{ conv}, 128 \text{ filters}3×3 conv,128 filters
o Output: 28×28×12828 \times 28 \times 12828×28×128
3. 1x1 Convolution followed by 5x5 Convolution Path:
o 1×1 conv,16 filters1 \times 1 \text{ conv}, 16 \text{ filters}1×1 conv,16 filters
o 5×5 conv,32 filters5 \times 5 \text{ conv}, 32 \text{ filters}5×5 conv,32 filters
o Output: 28×28×3228 \times 28 \times 3228×28×32
4. 3x3 Max Pooling followed by 1x1 Convolution Path:
o 3×3 max pool,stride 13 \times 3 \text{ max pool}, \text{stride 1}3×3 max pool,stride 1
o 1×1 conv,32 filters1 \times 1 \text{ conv}, 32 \text{ filters}1×1 conv,32 filters
o Output: 28×28×3228 \times 28 \times 3228×28×32
Concatenation of Paths
The outputs from the four paths are concatenated along the depth dimension to form the final
output of the Inception module:
• Final output size: 28×28×(64+128+32+32)=28×28×25628 \times 28 \times (64 + 128 + 32 + 32) =
28 \times 28 \times 25628×28×(64+128+32+32)=28×28×256
Benefits of Inception Modules
1. Parallel Convolutions: By using different filter sizes in parallel, the network can capture features
at multiple scales, improving the representational power.
2. Dimensionality Reduction: 1x1 convolutions help in reducing the computational complexity by
lowering the depth of the input volume before applying larger convolutions.
3. Efficiency: The module achieves a balance between computational efficiency and the ability to
capture a wide range of features.
Comparison to Traditional Convolutions
• Traditional convolutions typically use a single filter size at each layer, which may not capture all
relevant features.
• Inception modules, by contrast, use multiple filter sizes simultaneously, providing a richer and
more diverse feature representation.
Example: GoogLeNet Inception Module Design
The actual design of an Inception module in GoogLeNet can vary in terms of the number of
filters used in each path, but the core concept remains the same: multiple convolutions and
pooling operations in parallel, followed by concatenation.
Conclusion
Inception modules are a key innovation of the GoogLeNet architecture, enabling it to achieve
high performance with fewer parameters compared to earlier models like AlexNet and VGGNet.
This design philosophy has influenced many subsequent architectures, emphasizing the
importance of efficient feature extraction and multi-scale processing.