Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views92 pages

T4 - Image Classification

apuntes image processing

Uploaded by

Carla Abellana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views92 pages

T4 - Image Classification

apuntes image processing

Uploaded by

Carla Abellana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

Image classification

1
Image classification
Contents

1. The importance of image classification


2. Image classification
3. Supervised learning
4. Decision-theoretic classification
5. Classification based on local features
6. The bag of visual words approach
7. The deep learning approach

2
The importance of image classification

3
Image classification
Definition
■ Image classification is the task of using computer vision and machine learning algorithms to extract meaning from
an image
■ This action could be as simple as assigning a label to what the image contains, or as advanced as interpreting the
contents of an image and returning a human-readable sentence

“dog”

“a dog on the grass next to some flowers”

■ Image classification, at its very core, is the task of assigning a label to an image from a predefined set of
categories.

Image
dog: 95%; cat: 4%; panda: 1%
classifier

categories = {“dog”, “cat”, “panda”}

4
Image classification
The semantic gap
■ The semantic gap is the difference between how humans perceive an image and how this image is represented for
a computer

5
Image classification
The semantic gap
■ To bridge the semantic gap, we must somehow extract features that describe the image contents so that the
computer gets more information than raw pixel values.
■ Feature extraction is the process of taking an input image, applying an algorithm, and obtaining a feature vector
that quantifies some aspect of the image (its spatial distribution, its color, its texture …).
■ By means of feature extraction, we convert a WxH image into an N-dimensional feature vector, regardless of the
input image size:

Feature
H N
extraction

6
Image classification
The semantic gap
■ Question: what kind of features would you use to design an image classification system that classifies these three
characters?

7
Image classification
The semantic gap
■ Question: what kind of features would you use to design an image classification system that classifies these three
kinds of animals?

8
Image classification
The semantic gap
■ In “classic” computer vision, features are “hand-engineered”, that is, they are designed to capture specific traits of
the image. Examples of features include Local binary patterns (LBP, typically used for texture description), or
Histograms of Oriented Gradients (HOG, which represents the directions of the gradients across the image)

Feature Supervised
“dog”
extraction classifer

■ In “modern” computer vision, based on deep learning, features are learnt automatically by a neural network, which
also makes the job of assigning labels.

Deep neural
“dog”
network

9
Image classification
Challenges

■ Besides the semantic gap, image classification systems must face several challenges

10
Image classification
Supervised learning

■ Supervised machine learning: we need a set of labeled training data containing multiple examples of each class of
objects we want to recognize, so that we can build a model (the “classifier”) that learns the differences between
each category


“dog” “dog”
“dog”


“cat” “cat”
“cat”


“panda” “panda” “panda”

■ Therefore, supervised machine learning requires having annotated data which is representative of the categories
we want to be able to recognize.

11
Image classification
Supervised learning

■ Supervised machine learning requires splitting the available data in training and test subsets
■ The training set is used by our classifier to “learn” what each category looks like by making predictions on the
input data and then correct itself when predictions are wrong.
■ The test set is used to evaluate the performance of the classifier on new, unseen data.

■ Typical training and test splits are 75%-25%, 80%-20%, 67%-33%, …

■ The training and test subsets must be independent, not overlap and have the same presence of each category

12
Image classification
Supervised learning

■ A typical strategy for validating classifiers is cross-validation.


■ The goal of cross-validation is assessing how the performance of a classifier will generalize to new data, estimating
how accurately the classifier will perform in practice.
■ One round of cross-validation involves partitioning the available annotated data into the complementary training and
test subsets.
■ To reduce variability, in most methods multiple rounds of cross-validation are performed using different training/test
partitions, and the validation results are averaged over the rounds to estimate the classifier performance.

13
Image classification
True False
Supervised learning

■ The performance of a classifier can be measured using different metrics.


■ Most of these metrics can be defined upon the confusion matrix of the classifier.
■ On the confusion matrix, we use the terms:
■ positive and negative refer to the classifier's prediction for a specific class
■ true and false refer to whether that prediction corresponds to the true class of the objects
■ Then, we define:
■ True positives (TP) are the cases when the predicted class is positive and that matches the true class of the
object → match
■ False positives (FP) are the cases when the predicted class of the object is positive but the true class is
negative → mistake
■ False negatives (FN) are the cases when the predicted class of the object is negative but the true class is
positive → mistake
■ True Negatives (TN) are the cases when the predicted class of the object is negative and that matches the trie
class of the object → match

14
Image classification
Supervised learning

■ Based on these ideas, we define the following classification performance metrics for each category:
■ Accuracy: number of correct predictions made by the classifier over all predictions made.

■ Precision: number of correct predictions made by the classifier over all positive predictions made.

■ Recall: number of correct predictions made by the classifier over all positive possible cases.

15
Image classification
Supervised learning

■ Accuracy, precision and recall are computed on each category, and an averaged value over all categories is given
as a final performance metric of the classifier.
■ Often, precision and recall are combined in a single metric known as F1-measure:

𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 · 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
𝐹𝐹𝐹 = 2 ·
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 + 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟

■ What is a good performance for a classifier? It depends on the classification task, on the number of categories, and
many other factors. We should take into account that, in a classification task with N categories, a totally random
classifier would attain a 100/N (%) accuracy rate.
■ As a reference, the latest deep learning classifiers attain accuracies higher than 90% in image classification tasks
comprising 1000 categories.

16
Image classification
Supervised learning

■ When we train a supervised classifier on the training set, and then evaluate it on the test set, we often find that it
performs correctly.
■ But it happens sometimes that when we apply the classifier on images outside both the training and testing sets, it
performs poorly.
■ This problem is called generalization, the ability for a classifier to generalize and correctly predict the class label of
an image that does not exist as part of its training or testing data.
■ If a classifier does not generalize properly, we should ask ourselves if the training dataset contains sufficient
variability (viewpoint variation, intra-class variation, etc.) for the classifier learn properly. If not, we will need to gather
more training data and/or try a different supervised classification algorithms.

■ There exist many supervised classification algorithms, based on different machine learning paradigms (neural
networks, support vector machines, Bayesian classifiers, …).
■ We will introduce next a very simple classifier based on the paradigm of decision theory.

17
Image classification
Decision-theoretic classification: the k-NN classifier

■ Imagine we want to classifiy an image represented by the feature vector 𝒙𝒙 = 𝑥𝑥1 𝑥𝑥2 … 𝑥𝑥𝑛𝑛 into one of W classes.
■ Each image in the training dataset is also represented by an n-dimensional vector.
■ The k-nearest neighbor (k-NN) classifier is a simplified version of the matching-based decision-theoretic classification
■ The image we want to classify is compared against all the images in the training database, by computing the
distance between the vectors that represent them
■ The image is assigned to the majority class among the k nearest neighbors

k=1 → red k=3 → green

Training data Training data + test image

18
Image classification
Decision-theoretic classification: the k-NN classifier

■ Pros and cons of the k-NN classifier:


■ Pros
■ The k-NN algorithm is extremely simple to implement and understand
■ It takes no time to train (in fact, it does not learn anything)
■ Cons
■ Classifying a new test object requires a comparison to every single object in the training set, which makes working with
larger datasets computationally prohibitive (although there exist optimized versions of the k-NN classifier)
■ The k-NN classifier is better suited on low dimensional data (and images are high dimensional)

19
Practical exercise
Exercise 1 – k-NN image classification using color descriptors

20
Image classification
Limitations of our current approach

■ The implementation of the classifier in the previous exercise is limited to color classification. It is a
global descriptor that characterizes the image color as a whole.
■ In other words, our classifier would be unable to distinguish a green BMW from Kermit the frog

■ Therefore, we need a strategy to recognize that the object appearing in the image is a car, not a frog nor
anything else.
■ We will study two approaches to perform object classification:
■ Local features and Bag-of-visual-words (classic computer vision)
■ Deep neural networks (modern computer vision)

21
Image classification
Global vs local descriptors

■ Why do local descriptors work?

22
Image classification
Classification based on local features

■ Local features allow to distinguish objects from different categories

■ How are local features found?


■ Extracting keypoints
■ Creating descriptors of the visual appearance of the vicinity of each keypoint

23
Image classification
Classification based on local features

■ Keypoint extraction
■ Relevant visual information is in image regions where transitions occur

low relevance

high relevance

■ The goal is to detect points in the image located along edges, on corners, on dark blobs surrounded by light
areas, etc.
■ There exist many approaches to keypoint detection:
■ Corner detection: Harris, FAST, minimum eigenvalue …
■ Blobs: SIFT, SURF …

Harris (top-50) SURF (top-50)

24
Image classification
Classification based on local features

■ Feature description
■ The goal is to describe the visual appearance of the pixels surrounding the keypoints, thus providing local visual
appearance information of the object
■ A very typical approaches is to encode local shape information using a descriptor called histogram of oriented
gradients (HOG)

25
Image classification
Classification based on local features

■ Feature description
■ Histogram of oriented gradients (HOG) in a nutshell:
■ Set a region of interest (ROI) of predefined size interest around each keypoint (e.g. 24x24, 32x32 pixels …)
■ Divide the ROI into a predefined number of cells (e.g. 4x4)
■ Inside each cell, compute the magnitude and orientation of the gradient of each pixel
■ Quantify the orientation of the gradient into 8 directions and create an 8-dimensional histogram that contains, in each bin,
the sum of the magnitudes of the gradients that point in that direction. This allows finding the dominant orientation of the
keypoint
■ Concatenate the histograms of the 4x4=16 cells to obtain the HOG descriptor of the keypoint, which would be a 16x8=128-
dimensional vector

■ Variations to this general description: ROI size, number of cells, overlap between cells, number of quantified gradient
directions …

26
Image classification
Classification based on local features

■ It is highly likely that, in images containing the same type of object:


■ Similar keypoints will be extracted
■ Similar descriptors will be obtained

28
Image classification
Classification based on local features

■ A problem: if the object appears with different sizes in the images, different keypoints will be detected (even on the
same image if rescaled)

scale x0.25

scale x1

■ This fact requires performing keypoint extraction at different image scales by means of scale invariant keypoint
detection methods (e.g. SIFT, SURF), but this topic lies beyond the scope of this course.

29
Image classification
The bag of visual words approach

■ The main problem of classification based on local features is that each image is represented with a
descriptor of different size
■ Moreover, this approach is not very robust, as it is highly sensitive to variations in the object appearance
■ Furthermore, classification algorithms are more efficient when all objects are represented with
descriptors of fixed length, which among other issues, simplifies computing distances between images
■ The solution to this problem is the so-called bag of visual words approach to image classification

30
Image classification
The bag of visual words approach

■ The process generates a histogram of visual word occurrences that represent an image. These
histograms are used to train an image category classifier.
■ The idea is borrowed from natural language processing, in particular from text classification, where a
vocabulary of words is built:

sports

Vocabulary = {football, goal, NBA, Messi, …, unemployment, growth, crisis, taxes, …}


economy

■ Each document is then represented as a histogram of word counts, and these histograms are used to
classify documents into categories


■ Since images do not actually contain discrete words, we need to construct a “vocabulary” of
representative “visual words” that allow classifying the objects that appear in the image

31
Image classification
The bag of visual words approach

■ This vocabulary of visual words will be built from local features that are representative of the categories
we want to classify

… … … …

■ Notice that we only consider which visual words appear in the image, but its special location is not
considered

32
Image classification
The bag of visual words approach

■ Another issue worth mentioning is that we must take into account image variability when building our
vocabulary

?
… …

■ This means that we will have to extract local features from many images representative of each category,
and use them to build the vocabulary.

33
Image classification
The bag of visual words approach

■ Create the bag of features (or visual vocabulary) by extracting feature descriptors from representative images of each
category.
■ Extract keypoints from the image: this can be done with a feature detector (e.g. Harris corner detector) or by a
grid approach
■ Obtain a feature descriptor for each keypoint or grid element (e.g. HOG – histogram of oriented gradients)
■ Cluster feature descriptors to obtain groups of similar descriptors
■ Use the centroid of each cluster to obtain a vocabulary of visual words describing the whole dataset

34
Image classification
The bag of visual words approach

■ The clustering step: recall that in images containing the same type of object, similar keypoints will be extracted and
similar descriptors will be obtained

■ Once the images of all categories have gone through the descriptor computation process, we apply a clustering
algorithm to form groups of similar descriptors.

35
Image classification
The bag of visual words approach

■ Clustering algorithms create groups (aka clusters) of objects according to their similarity
■ In our case, objects are feature descriptors, and clusters will end up yielding visual words
■ The most typical clustering algorithm is k-means: it obtains compact clusters, operates on objects represented as
numerical attributes, and we must select the desired number of clusters K (i.e. size of the vocabulary)
■ How k-means works
■ Pick K descriptors, either randomly or based on some heuristic
■ Each one of these K descriptors is considered to be the representative of a visual word (or cluster centroid)
■ Assign each descriptor in the image set to the cluster that minimizes the distance to its cluster centroid
■ Re-compute the cluster centroids by averaging all the descriptors that have assigned to it
■ Repeat the two previous steps until convergence (i.e. until no descriptors change cluster)
■ The centroid of each cluster is considered as a visual word of our vocabulary

■ To obtain discriminant vocabularies, the value of K tends to be high (∼thousands)

36
Image classification
The bag of visual words approach

■ Once the visual vocabulary has been built:


■ For each training image, detect and extract features from the image and then use the approximate nearest
neighbor algorithm to construct a feature histogram for each image
■ Histogram bins values are increased based on the proximity of the descriptor to a particular cluster center.
■ The histogram length corresponds to the number of visual words that were created in the previous step.
■ The histogram becomes a feature vector for the image

■ In its simplest form, the feature histogram is a visual words count. A more elaborated version uses the tf·idf
weighting scheme (term frequency x inverse document frequency)

37
Image classification
The bag of visual words approach

■ Steps of the training process:


■ Repeat the previous step for each image in the training set to create training data for the classifier

38
Image classification
The bag of visual words approach

■ To classify new, unseen images:


■ For each test image, detect and extract features from the image and then use the approximate nearest neighbor
algorithm to construct a feature histogram for the new image
■ Feed this feature histogram to the classifier for it to predict the corresponding category

39
Image classification
The bag of visual words approach

■ Summarizing
■ The bag of visual words model is more robust than using local features directly for image classification, as it
relies on a vocabulary of visual words
■ The construction of the vocabulary requires a specific image set, which must be representative of the image
categories we want to recognize later
■ It allows representing each image with a fixed size histogram of visual word occurences

40
Image classification
Example of bag of visual words classification

41
Practical exercise
Exercise 2 – Image classification using bag of words and SVM classifier

42
Image classification
The deep learning approach

■ Since 2012, deep neural networks have caused a paradigm shift in computer vision:

Feature
N Classifer “dog”
extraction

descriptor

Deep neural
“dog”
network
■ But why?

43
Image classification
The deep learning approach

■ The computer vision research community organized an annual competition called “ImageNet Large
Scale Visual Recognition Challenge”
■ The ILSVRC challenge consisted in comparing the performance of different proposals for solving an
image classification problem involving tens of thousands of images into thousands of categories.

images of class “hammer”

https://www.image-net.org/challenges/LSVRC/

44
Image classification
The deep learning approach

■ In 2012, a convolutional neural network (CNN) called AlexNet outperformed all competitors in the
ImageNet 2012 Challenge, consisting in the classification of tens of thousands of images into thousands
of categories, obtaining a 84% accuracy (the best competitors, based on classical approaches, achieved
accuracies around 75%)

Alex Krizhevsky

45
Image classification
What are neural networks?

■ General architecture of artificial neural networks:


(1) (2) (3)
𝑤𝑤11 𝑤𝑤11 𝑤𝑤11
(1)
𝑤𝑤12 (3)
𝑤𝑤21

(1)
𝑤𝑤32 (3)
𝑤𝑤32

■ Elements:
■ Input layer: data
■ Hidden layers (>2: deep, >10: very deep)
■ Output layer: classification
■ Weights

46
Image classification
The Perceptron network

■ Let’s study the simplest neural network


■ The Perceptron network: a linear two-class classifier
■ It is a neural network with a single hidden layer
■ For each input, we have an associated weight
■ We compute the weighted sum of inputs:
𝑤𝑤1 · 𝑥𝑥1 + 𝑤𝑤2 · 𝑥𝑥2 + 𝑤𝑤3 · 𝑥𝑥3 + 𝑤𝑤4 · 1 = 𝑊𝑊 · 𝑥𝑥𝑖𝑖𝑇𝑇 = 𝑛𝑛𝑛𝑛𝑛𝑛

■ And then we apply an activation function, like the step function:


This is the label assigned at classification time

47
Image classification
The Perceptron network

■ However, the step function is not differentiable, which causes problems when applying gradient
descent to train our network
■ For this reason, continuous and differentiable activation functions are preferred, like:

1
1 + 𝑒𝑒 −𝑥𝑥 𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑥𝑥

0, 𝑖𝑖𝑖𝑖 𝑥𝑥 < 0 𝛼𝛼𝛼𝛼, 𝑖𝑖𝑖𝑖 𝑥𝑥 < 0


� �
𝑥𝑥, 𝑖𝑖𝑖𝑖 𝑥𝑥 ≥ 0 𝑥𝑥, 𝑖𝑖𝑖𝑖 𝑥𝑥 ≥ 0

48
Image classification
The Perceptron network

■ The most widely used algorithm to train the Perceptron network is the backpropagation algorithm
■ This algorithm goes like this:
■ The network weights are randomly initialized
■ The backpropagation process consists of iteratively executing:
– The forward pass: the training data is propagated through the network and the output predictions are obtained
– The backward pass: we update the network weights to minimize the classification error at the output of the network.

■ The forward and backward passes are repeated over the whole training dataset. Each one of these
cycles is called an epoch
■ When the predefined number of epochs is achieved, or when a sufficiently low classification error is
obtained on the training data, the training process concludes
■ The weights of the network are “frozen”, and it now can be used to classify new, unseen test data

49
Image classification
The Perceptron network

■ During the backward pass, we compute the classification error on the training data, also known as
the loss function, and tries to find its minimum using gradient descent

■ Example of loss minimization:

50
Image classification
From neural networks to deep neural networks

■ The linear nature of the Perceptron networks limits its practical applicability to linearly separable
problems

■ For this reason, multilayer Perceptron networks were the first step in the evolution towards deep
neural networks

Perceptron network with 2 hidden layers

51
Image classification
From neural networks to deep neural networks

■ More complex neural networks (with more hidden layers) can solve increasingly complex classification
problems
■ Example: a sample of the MNIST dataset
■ Data: 8x8 images, ∼1800 images
■ Network architecture: 64-32-16-10
precision recall f1-score
0 1.00 1.00 1.00
1 0.98 1.00 0.99
2 0.98 1.00 0.99
3 0.98 0.93 0.95
4 0.95 1.00 0.97
5 0.94 0.97 0.96
6 1.00 1.00 1.00
7 1.00 1.00 1.00
8 0.97 0.95 0.96
9 1.00 0.96 0.98
avg / total
0.98 0.98 0.98

52
Image classification
From neural networks to deep neural networks

■ However, when the classification problem becomes more complex, multilayer perceptrons start showing
their limitations
■ Example #1:
■ The full MNIST dataset
■ 28x28 images, 70000 images (7000 per class)
■ Network architecture: 784-256-128-10

Perfect learning!

53
Image classification
From neural networks to deep neural networks

■ Example #2:
■ The CIFAR-10 dataset

■ 32x32 RGB images, 60000 images (6000 per class)


■ Network architecture: 3072-1024-512-10

Overfitting!

54
Image classification
From neural networks to deep neural networks
■ Conclusions:
■ Standard neural networks (multilayer perceptrons) can solve relatively simple classification problems
■ However, they fail to obtain high classification accuracy on challenging image datasets that have variations in objects
appearance:
■ Translation
■ Rotation
■ Viewpoint
■ Intra-class variation

■ An additional limitation of standard neural networks, is that the size of the input layer depends on the image resolutions:
■ MNIST: 28x28 pixels → 784 nodes
■ CIFAR-10: 32x32x3 pixels → 3072 nodes
■ If we wanted to work with higher resolution input images, the size of the network would increase dramatically!
■ For 250x250 RGB images → 187500 nodes and the corresponding weights, as layers are fully connected

■ To overcome these limitations, a special type of neural networks emerged: convolutional neural networks (CNN)

55
Image classification
Convolutional neural networks

■ The main difference between CNN and standard neural networks is that they have different types of
layers
■ In standard NN, each neuron of a layer is connected to each neuron in the following layer by a weight: fully-
connected (FC) layers
■ In CNN:
■ we introduce convolutional layers instead of FC layers
■ we only use FC layers at the very end of the network, to obtain class probabilities
■ we introduce other types of layers, like pooling (to reduce the height and width of the data that propagates along the
network), non-linear activations (like reLU), drop-out (to reduce overfitting), etc.
224 x 224

224x224x3

56
Image classification
Convolutional neural networks

■ The main characteristic of CNN is that they apply convolutions at each convolutional layer
■ Convolution: image * filter
224 x 224

224x224x3

57
Image classification
Convolutional neural networks

■ The main characteristic of CNN is that they apply convolutions at each convolutional layer
■ We already know convolutions for:
■ Image denoising: averaging, Gaussian filters
1⁄9 1⁄9 1⁄9
𝐾𝐾[𝑚𝑚, 𝑛𝑛] = 1⁄9 1⁄9 1⁄9 Uniform smoothing kernel
1⁄9 1⁄9 1⁄9

■ Edge detection: Prewitt, Sobel filters

1 0 −1
𝐾𝐾[𝑚𝑚, 𝑛𝑛] = 1 0 −1 Horizontal derivative kernel
1 0 −1

■ Notice that we predefine the filter weights to accomplish a goal/extract a specific feature
■ In fact, the values of the kernel somehow “describe” the local feature we want to detect in the image

58
Image classification
Convolutional neural networks

■ For example:

1 0 −1
𝐾𝐾[𝑚𝑚, 𝑛𝑛] = 1 0 −1
1 0 −1
Horizontal derivative kernel

59
Image classification
Convolutional neural networks

■ For example:

1 0 −1
𝐾𝐾[𝑚𝑚, 𝑛𝑛] = 1 0 −1
1 0 −1
Horizontal derivative kernel

60
Image classification
Convolutional neural networks

■ For example:

1 0 −1
𝐾𝐾[𝑚𝑚, 𝑛𝑛] = 1 0 −1
1 0 −1
Horizontal derivative kernel

61
Image classification
Convolutional neural networks

■ For example:

1 0 −1
𝐾𝐾[𝑚𝑚, 𝑛𝑛] = 1 0 −1
1 0 −1
Horizontal derivative kernel

The horizontal derivative kernel obtains a high


response in those pixels located near a vertical
transition local feature, and a low response in
pixels located on uniform areas

62
Image classification
Convolutional neural networks

■ Visualizing the kernel as images helps understanding what it detects:

63
Image classification
Convolutional neural networks

■ Thus, analyzing the weights of a kernel (and visualizing them as an image), we can infer what type of image local
features it will detect
■ In classic image processing and computer vision, these kernels are predefined to detect specific types of features
■ The paradigm shift caused by CNN lies in the fact that
■ CNN apply hundreds of kernels on the images
■ The weights of these kernels are learned to extract the optimal features from the images to minimize the
classification loss function

64
Image classification
Convolutional neural networks

■ What type of features do the filters in the successive convolutional layers learn to extract? For example:
■ Detect edges from raw pixel data in the first layer
■ Use these edges to detect shapes (i.e., “blobs”) in the second layer
■ Use these shapes to detect higher-level features such as facial structures, parts of a car, etc. in the highest
layers of the network

■ This is called hierarchical feature learning from data

65
Image classification
Convolutional neural networks

■ More interestingly, depending on the data we use to train the CNN, different types of features will be
extracted by the network:

■ This means that the extracted features are optimal for the specific classification task we work on

66
Image classification
Convolutional neural networks
■ The most common types of layers used to build a CNN are:
■ Convolutional
■ Activation
■ Pooling
■ Fully-connected
■ Batch normalization
■ Dropout

■ Only convolutional and fully-connected layers contain weights that need to be learned during training
■ The layers are organized as 3D volumes: width x height x depth
■ To work with CIFAR-10, the input layer would be 32x32x3
■ The layers are not fully connected: neurons in subsequent layers are connected to a small region of the preceding
layer → this reduces the number of weights
■ The output layer is a 1x1xN volume containing class scores, where N is the number of categories

67
Image classification
Convolutional neural networks

■ Convolutional layers

■ Each kernel (of size 3x3, 5x5, etc.) is convolved with all the channels of the input volume
■ Each convolution produces a 2D “activation map” (also called “feature map”)

■ The size (width and height) of the activation maps depends on the kernel size, the input volume width and
height, and the step size (known as stride) used when computing the convolution

68
Image classification
Convolutional neural networks

■ Convolutional layers
■ AlexNet: 224 x 224

It applies 96 11x11 kernels


with stride 4
224x224x3

This generates a 55x55


feature map for each
kernel (224/4≅55)

69
Image classification
Convolutional neural networks

■ Convolutional layers

70
Image classification
Convolutional neural networks

■ Activation layers
■ After each convolutional layer comes an activation layer
■ An activation layers takes a volume at its input, and applies an activation function (sigmoid, tanh, reLU…) to
each of its values:

■ By doing so, the network only keeps the highest responses of the filters in the previous convolution layer, thus
focusing on “important” features

■ For example, in AlexNet, the 96 feature maps obtained after the convolution in the first layer go through a reLU
activation function before entering the second layer

71
Image classification
Convolutional neural networks

■ Pooling layers
■ Their goal is to reduce the size of the volumes along the network

■ Remember this size reduction can also be achieved by the convolutional layers, if we use a stride >1
■ For this reason, some authors argue if pooling layers could be replaced by convolutional layers

72
Image classification
Convolutional neural networks

■ Pooling layers
■ AlexNet: 224 x 224

224x224x3

It applies 256 5x5 kernels


with stride 1

This generates a 55x55 2x2 Max Pooling is applied to


feature map for each reduce each of the 256
kernel (256 maps in total) features maps to size 27x27

73
Image classification
Convolutional neural networks

■ Fully-connected layers
■ FC layers are placed at the end of the network
■ After the last convolution layer, feature maps are flattened to a one-dimensional vector, which is fed to a FC layer
to perform classification
■ They are typically placed before a softmax classifier, that normalizes class scores to probabilities
3x3 Max Pooling
224 x 224
256x13x13 → 256x4x4 → 1x4096

224x224x3

74
Image classification
Convolutional neural networks

■ Batch normalization and drop-out layers


■ BN layers normalize data to make training the networks in less epochs and more stable
■ Normalization is done by:
– Removing the mean of the data
– Dividing the data by its standard deviation

■ Drop-out layers randomly disconnect connections between consecutive layers, aiming at the avoidance of
overfitting

75
Image classification
Convolutional neural networks

■ So, how are CNN built?


■ By stacking convolutional, pooling, activation and FC layers in a specific order we obtain a CNN architecture
■ Typically, these layers are stacked in the following sequential order:

with 0 ≤ 𝑁𝑁 ≤ 3, 𝑀𝑀 ≥ 0, 0 ≤ 𝐾𝐾 ≤ 2

76
Image classification
Convolutional neural networks

■ Examples of CNN architectures:


■ ShallowNet

■ AlexNet

77
Image classification
Convolutional neural networks

■ Examples of CNN architectures:


■ VGGNet

78
Image classification
Convolutional neural networks

■ Training convolutional neural networks


■ Training a deep neural network means learning the optimal value of millions of parameters (in AlexNet, 61M)
■ This makes learning processes long and not always easy

■ Why have deep neural networks emerged as the leading approach to image classification?
■ Availability of very large annotated datasets (millions of images)
■ Availability of specialized hardware for parallel computation (GPU)

■ This makes it possible to build deeper networks and train them with more data than ever, which allows increasing
classification accuracy due to the particular behaviour of deep learning:

79
Image classification
Convolutional neural networks

■ Different ways of using deep neural networks in image classification tasks:


■ Training from scratch
■ Using pre-trained networks: replacing part of the network layers with randomly initialized layers and retrain only
those (“fine-tuning”)

80
Image classification
Convolutional neural networks

■ Example 1:

81
Image classification
Convolutional neural networks

■ Example 2:

82
Beyond image classification
CNNs for image segmentation

■ Convolutional neural networks can also be applied to image segmentation


■ Image classification vs segmentation

Characteristic Classification Segmentation

Dimensionality Spatial reduction Spatial preservation

Application Recognition Precise delimitation

Training annotations Labels Pixel maps

■ In classification, CNNs compress image information (feature extraction) to derive a label


■ In segmentation, we need CNNs to output an image of the same size as its input
■ So, how do we adapt CNNs to preserve spatial information?

83
Beyond image classification
CNNs for image segmentation
■ U-Net: one of the most widely used deep neural networks for image segmentation

residual connections

encoder decoder
■ Encoder: convolutional and pooling layers for feature extraction (like in “classic” CNNs) progressive spatial
resolution reduction
■ Decoder: upsampling layers and transposed convolutions for progressive spatial resolution increase
■ Residual connections: translate features of the encoder to the decoder for pixel-level segmentation

84
Beyond image classification
CNNs for image segmentation

■ The encoder of U-Net:


■ The encoder (or contracting path) follows the typical architecture of a convolutional network
■ It consists of the repeated application of convolutional layers, each followed by a ReLU layer and a max pooling
operation with stride 2 for downsampling
■ At each downsampling step we double the number of feature channels

85
Beyond image classification
CNNs for image segmentation

■ The decoder of U-Net:


■ The decoder (or expansive path) upsamples the feature map generated by the encoder to generate an image of
the same size of the input
■ The key elements for doing the upsampling are the transposed convolutional (or deconvolutional) and the
unpooling layers
■ Transposed convolution

■ A convolutional layer with stride=2 halves the horizontal and vertical dimensions of the feature map
■ A transposed convolutional layer with stride=2 duplicates the horizontal and vertical dimensions of the feature map

The stride parameter determines the internal 3x3 input


zero-padding between pixels of the image 3x3 kernel
6x6 output

86
Beyond image classification
CNNs for image segmentation

■ The decoder of U-Net:


■ Unpooling layers

87
Beyond image classification
CNNs for image segmentation

■ The residual connections


■ Also known as “skip connections”, their goal is to keep details. Why?
■ As the number of layers in the encoder and decoder increase, we effectively "shrink" the feature map more and
more. As such, the encoder may discard features that are more detailed in favor of more general features
■ To make sure that the encoder and decoder take in features that are both general and detailed, we can
reintroduce them by making that every decoder layer incorporates the feature map from its corresponding
encoder layer

88
Beyond image classification
Other applications of CNNs

■ Other applications of convolutional neural networks in computer vision


■ Object detection

89
Beyond image classification
Other applications of CNNs

■ Other applications of convolutional neural networks in computer vision


■ Super-resolution

90
Beyond image classification
Other applications of CNNs

■ Other applications of convolutional neural networks in computer vision


■ Image manipulation: for instance, Neural Style Transfer

91
Practical exercise
Exercise 3 – Image classification using a neural network

92
Image classification
More on deep learning using Matlab

93

You might also like