T4 - Image Classification
T4 - Image Classification
1
Image classification
Contents
2
The importance of image classification
3
Image classification
Definition
■ Image classification is the task of using computer vision and machine learning algorithms to extract meaning from
an image
■ This action could be as simple as assigning a label to what the image contains, or as advanced as interpreting the
contents of an image and returning a human-readable sentence
“dog”
■ Image classification, at its very core, is the task of assigning a label to an image from a predefined set of
categories.
Image
dog: 95%; cat: 4%; panda: 1%
classifier
4
Image classification
The semantic gap
■ The semantic gap is the difference between how humans perceive an image and how this image is represented for
a computer
5
Image classification
The semantic gap
■ To bridge the semantic gap, we must somehow extract features that describe the image contents so that the
computer gets more information than raw pixel values.
■ Feature extraction is the process of taking an input image, applying an algorithm, and obtaining a feature vector
that quantifies some aspect of the image (its spatial distribution, its color, its texture …).
■ By means of feature extraction, we convert a WxH image into an N-dimensional feature vector, regardless of the
input image size:
Feature
H N
extraction
6
Image classification
The semantic gap
■ Question: what kind of features would you use to design an image classification system that classifies these three
characters?
7
Image classification
The semantic gap
■ Question: what kind of features would you use to design an image classification system that classifies these three
kinds of animals?
8
Image classification
The semantic gap
■ In “classic” computer vision, features are “hand-engineered”, that is, they are designed to capture specific traits of
the image. Examples of features include Local binary patterns (LBP, typically used for texture description), or
Histograms of Oriented Gradients (HOG, which represents the directions of the gradients across the image)
Feature Supervised
“dog”
extraction classifer
■ In “modern” computer vision, based on deep learning, features are learnt automatically by a neural network, which
also makes the job of assigning labels.
Deep neural
“dog”
network
9
Image classification
Challenges
■ Besides the semantic gap, image classification systems must face several challenges
10
Image classification
Supervised learning
■ Supervised machine learning: we need a set of labeled training data containing multiple examples of each class of
objects we want to recognize, so that we can build a model (the “classifier”) that learns the differences between
each category
…
“dog” “dog”
“dog”
…
“cat” “cat”
“cat”
…
“panda” “panda” “panda”
■ Therefore, supervised machine learning requires having annotated data which is representative of the categories
we want to be able to recognize.
11
Image classification
Supervised learning
■ Supervised machine learning requires splitting the available data in training and test subsets
■ The training set is used by our classifier to “learn” what each category looks like by making predictions on the
input data and then correct itself when predictions are wrong.
■ The test set is used to evaluate the performance of the classifier on new, unseen data.
■ The training and test subsets must be independent, not overlap and have the same presence of each category
12
Image classification
Supervised learning
13
Image classification
True False
Supervised learning
14
Image classification
Supervised learning
■ Based on these ideas, we define the following classification performance metrics for each category:
■ Accuracy: number of correct predictions made by the classifier over all predictions made.
■ Precision: number of correct predictions made by the classifier over all positive predictions made.
■ Recall: number of correct predictions made by the classifier over all positive possible cases.
15
Image classification
Supervised learning
■ Accuracy, precision and recall are computed on each category, and an averaged value over all categories is given
as a final performance metric of the classifier.
■ Often, precision and recall are combined in a single metric known as F1-measure:
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 · 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
𝐹𝐹𝐹 = 2 ·
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 + 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
■ What is a good performance for a classifier? It depends on the classification task, on the number of categories, and
many other factors. We should take into account that, in a classification task with N categories, a totally random
classifier would attain a 100/N (%) accuracy rate.
■ As a reference, the latest deep learning classifiers attain accuracies higher than 90% in image classification tasks
comprising 1000 categories.
16
Image classification
Supervised learning
■ When we train a supervised classifier on the training set, and then evaluate it on the test set, we often find that it
performs correctly.
■ But it happens sometimes that when we apply the classifier on images outside both the training and testing sets, it
performs poorly.
■ This problem is called generalization, the ability for a classifier to generalize and correctly predict the class label of
an image that does not exist as part of its training or testing data.
■ If a classifier does not generalize properly, we should ask ourselves if the training dataset contains sufficient
variability (viewpoint variation, intra-class variation, etc.) for the classifier learn properly. If not, we will need to gather
more training data and/or try a different supervised classification algorithms.
■ There exist many supervised classification algorithms, based on different machine learning paradigms (neural
networks, support vector machines, Bayesian classifiers, …).
■ We will introduce next a very simple classifier based on the paradigm of decision theory.
17
Image classification
Decision-theoretic classification: the k-NN classifier
■ Imagine we want to classifiy an image represented by the feature vector 𝒙𝒙 = 𝑥𝑥1 𝑥𝑥2 … 𝑥𝑥𝑛𝑛 into one of W classes.
■ Each image in the training dataset is also represented by an n-dimensional vector.
■ The k-nearest neighbor (k-NN) classifier is a simplified version of the matching-based decision-theoretic classification
■ The image we want to classify is compared against all the images in the training database, by computing the
distance between the vectors that represent them
■ The image is assigned to the majority class among the k nearest neighbors
18
Image classification
Decision-theoretic classification: the k-NN classifier
19
Practical exercise
Exercise 1 – k-NN image classification using color descriptors
20
Image classification
Limitations of our current approach
■ The implementation of the classifier in the previous exercise is limited to color classification. It is a
global descriptor that characterizes the image color as a whole.
■ In other words, our classifier would be unable to distinguish a green BMW from Kermit the frog
■ Therefore, we need a strategy to recognize that the object appearing in the image is a car, not a frog nor
anything else.
■ We will study two approaches to perform object classification:
■ Local features and Bag-of-visual-words (classic computer vision)
■ Deep neural networks (modern computer vision)
21
Image classification
Global vs local descriptors
22
Image classification
Classification based on local features
23
Image classification
Classification based on local features
■ Keypoint extraction
■ Relevant visual information is in image regions where transitions occur
low relevance
high relevance
■ The goal is to detect points in the image located along edges, on corners, on dark blobs surrounded by light
areas, etc.
■ There exist many approaches to keypoint detection:
■ Corner detection: Harris, FAST, minimum eigenvalue …
■ Blobs: SIFT, SURF …
24
Image classification
Classification based on local features
■ Feature description
■ The goal is to describe the visual appearance of the pixels surrounding the keypoints, thus providing local visual
appearance information of the object
■ A very typical approaches is to encode local shape information using a descriptor called histogram of oriented
gradients (HOG)
25
Image classification
Classification based on local features
■ Feature description
■ Histogram of oriented gradients (HOG) in a nutshell:
■ Set a region of interest (ROI) of predefined size interest around each keypoint (e.g. 24x24, 32x32 pixels …)
■ Divide the ROI into a predefined number of cells (e.g. 4x4)
■ Inside each cell, compute the magnitude and orientation of the gradient of each pixel
■ Quantify the orientation of the gradient into 8 directions and create an 8-dimensional histogram that contains, in each bin,
the sum of the magnitudes of the gradients that point in that direction. This allows finding the dominant orientation of the
keypoint
■ Concatenate the histograms of the 4x4=16 cells to obtain the HOG descriptor of the keypoint, which would be a 16x8=128-
dimensional vector
■ Variations to this general description: ROI size, number of cells, overlap between cells, number of quantified gradient
directions …
26
Image classification
Classification based on local features
28
Image classification
Classification based on local features
■ A problem: if the object appears with different sizes in the images, different keypoints will be detected (even on the
same image if rescaled)
scale x0.25
scale x1
■ This fact requires performing keypoint extraction at different image scales by means of scale invariant keypoint
detection methods (e.g. SIFT, SURF), but this topic lies beyond the scope of this course.
29
Image classification
The bag of visual words approach
■ The main problem of classification based on local features is that each image is represented with a
descriptor of different size
■ Moreover, this approach is not very robust, as it is highly sensitive to variations in the object appearance
■ Furthermore, classification algorithms are more efficient when all objects are represented with
descriptors of fixed length, which among other issues, simplifies computing distances between images
■ The solution to this problem is the so-called bag of visual words approach to image classification
30
Image classification
The bag of visual words approach
■ The process generates a histogram of visual word occurrences that represent an image. These
histograms are used to train an image category classifier.
■ The idea is borrowed from natural language processing, in particular from text classification, where a
vocabulary of words is built:
sports
■ Each document is then represented as a histogram of word counts, and these histograms are used to
classify documents into categories
…
…
■ Since images do not actually contain discrete words, we need to construct a “vocabulary” of
representative “visual words” that allow classifying the objects that appear in the image
31
Image classification
The bag of visual words approach
■ This vocabulary of visual words will be built from local features that are representative of the categories
we want to classify
… … … …
■ Notice that we only consider which visual words appear in the image, but its special location is not
considered
32
Image classification
The bag of visual words approach
■ Another issue worth mentioning is that we must take into account image variability when building our
vocabulary
?
… …
■ This means that we will have to extract local features from many images representative of each category,
and use them to build the vocabulary.
33
Image classification
The bag of visual words approach
■ Create the bag of features (or visual vocabulary) by extracting feature descriptors from representative images of each
category.
■ Extract keypoints from the image: this can be done with a feature detector (e.g. Harris corner detector) or by a
grid approach
■ Obtain a feature descriptor for each keypoint or grid element (e.g. HOG – histogram of oriented gradients)
■ Cluster feature descriptors to obtain groups of similar descriptors
■ Use the centroid of each cluster to obtain a vocabulary of visual words describing the whole dataset
34
Image classification
The bag of visual words approach
■ The clustering step: recall that in images containing the same type of object, similar keypoints will be extracted and
similar descriptors will be obtained
■ Once the images of all categories have gone through the descriptor computation process, we apply a clustering
algorithm to form groups of similar descriptors.
35
Image classification
The bag of visual words approach
■ Clustering algorithms create groups (aka clusters) of objects according to their similarity
■ In our case, objects are feature descriptors, and clusters will end up yielding visual words
■ The most typical clustering algorithm is k-means: it obtains compact clusters, operates on objects represented as
numerical attributes, and we must select the desired number of clusters K (i.e. size of the vocabulary)
■ How k-means works
■ Pick K descriptors, either randomly or based on some heuristic
■ Each one of these K descriptors is considered to be the representative of a visual word (or cluster centroid)
■ Assign each descriptor in the image set to the cluster that minimizes the distance to its cluster centroid
■ Re-compute the cluster centroids by averaging all the descriptors that have assigned to it
■ Repeat the two previous steps until convergence (i.e. until no descriptors change cluster)
■ The centroid of each cluster is considered as a visual word of our vocabulary
36
Image classification
The bag of visual words approach
■ In its simplest form, the feature histogram is a visual words count. A more elaborated version uses the tf·idf
weighting scheme (term frequency x inverse document frequency)
37
Image classification
The bag of visual words approach
38
Image classification
The bag of visual words approach
39
Image classification
The bag of visual words approach
■ Summarizing
■ The bag of visual words model is more robust than using local features directly for image classification, as it
relies on a vocabulary of visual words
■ The construction of the vocabulary requires a specific image set, which must be representative of the image
categories we want to recognize later
■ It allows representing each image with a fixed size histogram of visual word occurences
40
Image classification
Example of bag of visual words classification
41
Practical exercise
Exercise 2 – Image classification using bag of words and SVM classifier
42
Image classification
The deep learning approach
■ Since 2012, deep neural networks have caused a paradigm shift in computer vision:
Feature
N Classifer “dog”
extraction
descriptor
Deep neural
“dog”
network
■ But why?
43
Image classification
The deep learning approach
■ The computer vision research community organized an annual competition called “ImageNet Large
Scale Visual Recognition Challenge”
■ The ILSVRC challenge consisted in comparing the performance of different proposals for solving an
image classification problem involving tens of thousands of images into thousands of categories.
https://www.image-net.org/challenges/LSVRC/
44
Image classification
The deep learning approach
■ In 2012, a convolutional neural network (CNN) called AlexNet outperformed all competitors in the
ImageNet 2012 Challenge, consisting in the classification of tens of thousands of images into thousands
of categories, obtaining a 84% accuracy (the best competitors, based on classical approaches, achieved
accuracies around 75%)
Alex Krizhevsky
45
Image classification
What are neural networks?
(1)
𝑤𝑤32 (3)
𝑤𝑤32
■ Elements:
■ Input layer: data
■ Hidden layers (>2: deep, >10: very deep)
■ Output layer: classification
■ Weights
46
Image classification
The Perceptron network
47
Image classification
The Perceptron network
■ However, the step function is not differentiable, which causes problems when applying gradient
descent to train our network
■ For this reason, continuous and differentiable activation functions are preferred, like:
1
1 + 𝑒𝑒 −𝑥𝑥 𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑥𝑥
48
Image classification
The Perceptron network
■ The most widely used algorithm to train the Perceptron network is the backpropagation algorithm
■ This algorithm goes like this:
■ The network weights are randomly initialized
■ The backpropagation process consists of iteratively executing:
– The forward pass: the training data is propagated through the network and the output predictions are obtained
– The backward pass: we update the network weights to minimize the classification error at the output of the network.
■ The forward and backward passes are repeated over the whole training dataset. Each one of these
cycles is called an epoch
■ When the predefined number of epochs is achieved, or when a sufficiently low classification error is
obtained on the training data, the training process concludes
■ The weights of the network are “frozen”, and it now can be used to classify new, unseen test data
49
Image classification
The Perceptron network
■ During the backward pass, we compute the classification error on the training data, also known as
the loss function, and tries to find its minimum using gradient descent
50
Image classification
From neural networks to deep neural networks
■ The linear nature of the Perceptron networks limits its practical applicability to linearly separable
problems
■ For this reason, multilayer Perceptron networks were the first step in the evolution towards deep
neural networks
51
Image classification
From neural networks to deep neural networks
■ More complex neural networks (with more hidden layers) can solve increasingly complex classification
problems
■ Example: a sample of the MNIST dataset
■ Data: 8x8 images, ∼1800 images
■ Network architecture: 64-32-16-10
precision recall f1-score
0 1.00 1.00 1.00
1 0.98 1.00 0.99
2 0.98 1.00 0.99
3 0.98 0.93 0.95
4 0.95 1.00 0.97
5 0.94 0.97 0.96
6 1.00 1.00 1.00
7 1.00 1.00 1.00
8 0.97 0.95 0.96
9 1.00 0.96 0.98
avg / total
0.98 0.98 0.98
52
Image classification
From neural networks to deep neural networks
■ However, when the classification problem becomes more complex, multilayer perceptrons start showing
their limitations
■ Example #1:
■ The full MNIST dataset
■ 28x28 images, 70000 images (7000 per class)
■ Network architecture: 784-256-128-10
Perfect learning!
53
Image classification
From neural networks to deep neural networks
■ Example #2:
■ The CIFAR-10 dataset
Overfitting!
54
Image classification
From neural networks to deep neural networks
■ Conclusions:
■ Standard neural networks (multilayer perceptrons) can solve relatively simple classification problems
■ However, they fail to obtain high classification accuracy on challenging image datasets that have variations in objects
appearance:
■ Translation
■ Rotation
■ Viewpoint
■ Intra-class variation
■ An additional limitation of standard neural networks, is that the size of the input layer depends on the image resolutions:
■ MNIST: 28x28 pixels → 784 nodes
■ CIFAR-10: 32x32x3 pixels → 3072 nodes
■ If we wanted to work with higher resolution input images, the size of the network would increase dramatically!
■ For 250x250 RGB images → 187500 nodes and the corresponding weights, as layers are fully connected
■ To overcome these limitations, a special type of neural networks emerged: convolutional neural networks (CNN)
55
Image classification
Convolutional neural networks
■ The main difference between CNN and standard neural networks is that they have different types of
layers
■ In standard NN, each neuron of a layer is connected to each neuron in the following layer by a weight: fully-
connected (FC) layers
■ In CNN:
■ we introduce convolutional layers instead of FC layers
■ we only use FC layers at the very end of the network, to obtain class probabilities
■ we introduce other types of layers, like pooling (to reduce the height and width of the data that propagates along the
network), non-linear activations (like reLU), drop-out (to reduce overfitting), etc.
224 x 224
224x224x3
56
Image classification
Convolutional neural networks
■ The main characteristic of CNN is that they apply convolutions at each convolutional layer
■ Convolution: image * filter
224 x 224
224x224x3
57
Image classification
Convolutional neural networks
■ The main characteristic of CNN is that they apply convolutions at each convolutional layer
■ We already know convolutions for:
■ Image denoising: averaging, Gaussian filters
1⁄9 1⁄9 1⁄9
𝐾𝐾[𝑚𝑚, 𝑛𝑛] = 1⁄9 1⁄9 1⁄9 Uniform smoothing kernel
1⁄9 1⁄9 1⁄9
1 0 −1
𝐾𝐾[𝑚𝑚, 𝑛𝑛] = 1 0 −1 Horizontal derivative kernel
1 0 −1
■ Notice that we predefine the filter weights to accomplish a goal/extract a specific feature
■ In fact, the values of the kernel somehow “describe” the local feature we want to detect in the image
58
Image classification
Convolutional neural networks
■ For example:
1 0 −1
𝐾𝐾[𝑚𝑚, 𝑛𝑛] = 1 0 −1
1 0 −1
Horizontal derivative kernel
59
Image classification
Convolutional neural networks
■ For example:
1 0 −1
𝐾𝐾[𝑚𝑚, 𝑛𝑛] = 1 0 −1
1 0 −1
Horizontal derivative kernel
60
Image classification
Convolutional neural networks
■ For example:
1 0 −1
𝐾𝐾[𝑚𝑚, 𝑛𝑛] = 1 0 −1
1 0 −1
Horizontal derivative kernel
61
Image classification
Convolutional neural networks
■ For example:
1 0 −1
𝐾𝐾[𝑚𝑚, 𝑛𝑛] = 1 0 −1
1 0 −1
Horizontal derivative kernel
62
Image classification
Convolutional neural networks
63
Image classification
Convolutional neural networks
■ Thus, analyzing the weights of a kernel (and visualizing them as an image), we can infer what type of image local
features it will detect
■ In classic image processing and computer vision, these kernels are predefined to detect specific types of features
■ The paradigm shift caused by CNN lies in the fact that
■ CNN apply hundreds of kernels on the images
■ The weights of these kernels are learned to extract the optimal features from the images to minimize the
classification loss function
64
Image classification
Convolutional neural networks
■ What type of features do the filters in the successive convolutional layers learn to extract? For example:
■ Detect edges from raw pixel data in the first layer
■ Use these edges to detect shapes (i.e., “blobs”) in the second layer
■ Use these shapes to detect higher-level features such as facial structures, parts of a car, etc. in the highest
layers of the network
65
Image classification
Convolutional neural networks
■ More interestingly, depending on the data we use to train the CNN, different types of features will be
extracted by the network:
■ This means that the extracted features are optimal for the specific classification task we work on
66
Image classification
Convolutional neural networks
■ The most common types of layers used to build a CNN are:
■ Convolutional
■ Activation
■ Pooling
■ Fully-connected
■ Batch normalization
■ Dropout
■ Only convolutional and fully-connected layers contain weights that need to be learned during training
■ The layers are organized as 3D volumes: width x height x depth
■ To work with CIFAR-10, the input layer would be 32x32x3
■ The layers are not fully connected: neurons in subsequent layers are connected to a small region of the preceding
layer → this reduces the number of weights
■ The output layer is a 1x1xN volume containing class scores, where N is the number of categories
67
Image classification
Convolutional neural networks
■ Convolutional layers
■ Each kernel (of size 3x3, 5x5, etc.) is convolved with all the channels of the input volume
■ Each convolution produces a 2D “activation map” (also called “feature map”)
■ The size (width and height) of the activation maps depends on the kernel size, the input volume width and
height, and the step size (known as stride) used when computing the convolution
68
Image classification
Convolutional neural networks
■ Convolutional layers
■ AlexNet: 224 x 224
69
Image classification
Convolutional neural networks
■ Convolutional layers
70
Image classification
Convolutional neural networks
■ Activation layers
■ After each convolutional layer comes an activation layer
■ An activation layers takes a volume at its input, and applies an activation function (sigmoid, tanh, reLU…) to
each of its values:
■ By doing so, the network only keeps the highest responses of the filters in the previous convolution layer, thus
focusing on “important” features
■ For example, in AlexNet, the 96 feature maps obtained after the convolution in the first layer go through a reLU
activation function before entering the second layer
71
Image classification
Convolutional neural networks
■ Pooling layers
■ Their goal is to reduce the size of the volumes along the network
■ Remember this size reduction can also be achieved by the convolutional layers, if we use a stride >1
■ For this reason, some authors argue if pooling layers could be replaced by convolutional layers
72
Image classification
Convolutional neural networks
■ Pooling layers
■ AlexNet: 224 x 224
224x224x3
73
Image classification
Convolutional neural networks
■ Fully-connected layers
■ FC layers are placed at the end of the network
■ After the last convolution layer, feature maps are flattened to a one-dimensional vector, which is fed to a FC layer
to perform classification
■ They are typically placed before a softmax classifier, that normalizes class scores to probabilities
3x3 Max Pooling
224 x 224
256x13x13 → 256x4x4 → 1x4096
224x224x3
74
Image classification
Convolutional neural networks
■ Drop-out layers randomly disconnect connections between consecutive layers, aiming at the avoidance of
overfitting
75
Image classification
Convolutional neural networks
with 0 ≤ 𝑁𝑁 ≤ 3, 𝑀𝑀 ≥ 0, 0 ≤ 𝐾𝐾 ≤ 2
76
Image classification
Convolutional neural networks
■ AlexNet
77
Image classification
Convolutional neural networks
78
Image classification
Convolutional neural networks
■ Why have deep neural networks emerged as the leading approach to image classification?
■ Availability of very large annotated datasets (millions of images)
■ Availability of specialized hardware for parallel computation (GPU)
■ This makes it possible to build deeper networks and train them with more data than ever, which allows increasing
classification accuracy due to the particular behaviour of deep learning:
79
Image classification
Convolutional neural networks
80
Image classification
Convolutional neural networks
■ Example 1:
81
Image classification
Convolutional neural networks
■ Example 2:
82
Beyond image classification
CNNs for image segmentation
83
Beyond image classification
CNNs for image segmentation
■ U-Net: one of the most widely used deep neural networks for image segmentation
residual connections
encoder decoder
■ Encoder: convolutional and pooling layers for feature extraction (like in “classic” CNNs) progressive spatial
resolution reduction
■ Decoder: upsampling layers and transposed convolutions for progressive spatial resolution increase
■ Residual connections: translate features of the encoder to the decoder for pixel-level segmentation
84
Beyond image classification
CNNs for image segmentation
85
Beyond image classification
CNNs for image segmentation
■ A convolutional layer with stride=2 halves the horizontal and vertical dimensions of the feature map
■ A transposed convolutional layer with stride=2 duplicates the horizontal and vertical dimensions of the feature map
86
Beyond image classification
CNNs for image segmentation
87
Beyond image classification
CNNs for image segmentation
88
Beyond image classification
Other applications of CNNs
89
Beyond image classification
Other applications of CNNs
90
Beyond image classification
Other applications of CNNs
91
Practical exercise
Exercise 3 – Image classification using a neural network
92
Image classification
More on deep learning using Matlab
93