Departement : Mathematics & Computer Science
Master of DPEIC – First year
Semester 2
OR & Artificial Intelligence
Chapter III - Supervised Machine Learning
Pr. Soufiane HAMIDA 1
Supervised ML Algorithms
How Supervised Learning Works?
• In supervised learning, models are trained using labelled dataset, where the model learns
about each type of data. Once the training process is completed, the model is tested on the
basis of test data (a subset of the training set), and then it predicts the output.
• The working of Supervised learning can be easily understood by the below example and
diagram:
Pr. Soufiane HAMIDA 6
Steps Involved in Supervised Learning
• First Determine the type of training dataset
• Collect/Gather the labelled training data.
• Split the training dataset into training dataset, test dataset, and validation dataset.
• Determine the input features of the training dataset, which should have enough knowledge
so that the model can accurately predict the output.
• Determine the suitable algorithm for the model, such as support vector machine, decision
tree, etc.
• Execute the algorithm on the training dataset. Sometimes we need validation sets as the
control parameters, which are the subset of training datasets.
• Evaluate the accuracy of the model by providing the test set. If the model predicts the correct
output, which means our model is accurate.
Pr. Soufiane HAMIDA 7
Key Concepts
To master supervised learning, you absolutely must understand
and know the following 4 concepts:
1. The Dataset
2. The learning algorithm
3. The Model and its parameters
4. The Cost Function
Pr. Soufiane HAMIDA 8
Steps Involved in Supervised Learning
1) The Dataset
We talk about supervised learning when we provide a machine with
many examples (࢞, ࢟) in order to make it learn the relationship that
connects ࢞ to ࢟.
Pr. Soufiane HAMIDA 9
Steps Involved in Supervised Learning
1) The Dataset
• The variable ࢟ is called Target. This is the
value we are trying to predict.
• The variable ࢞ is called Feature. A Feature
influences the value of ࢟, and we generally
have a lot of Features (࢞, ࢞, …) in our
Dataset which we group together in a
matrix ࢄ.
Example: a Dataset brings together examples of
apartments with their price ࢟ as well as some of
their characteristics (Features).
Pr. Soufiane HAMIDA 11
Steps Involved in Supervised Learning
2) The learning algorithm
• The main objective in Supervised Learning is to find the model parameters that
minimize the Cost Function. To do this, we use a learning algorithm, the most
common example being the Gradient Descent algorithm,
Pr. Soufiane HAMIDA 12
Steps Involved in Supervised Learning
3) The Model and its parameters
• The development of a model from Dataset. It can be a linear model or a non-linear
model like you.
• We define ࢇ, ࢈, ࢉ, etc. as the parameters of a model.
Pr. Soufiane HAMIDA 13
Steps Involved in Supervised Learning
4) The Cost Function
A model can produce errors when making
predictions compared to the actual values in our
dataset. These errors are a measure of how well
the model is performing — a lower error indicates
a better fit to the data.
The method by which we aggregate these errors
to measure the overall performance of the model is
known as the Cost Function or Loss Function.
Pr. Soufiane HAMIDA 14
Steps Involved in Supervised Learning
4) The Cost Function
• A 'good' model is generally characterized by
its ability to make accurate predictions on
new, previously unseen data.
• The smaller the value returned by the Cost
Function, the smaller the differences
between the predicted and actual values,
indicating a better performing model.
Pr. Soufiane HAMIDA 15
Types of Supervised ML Algorithms
• Supervised learning can be further divided into two types of problems:
Pr. Soufiane HAMIDA 16
Regression vs. Classification in ML
Pr. Soufiane HAMIDA 18
Recap
Regression Algorithm Classification Algorithm
In Regression, the output variable must be of In Classification, the output variable must be a
continuous nature or real value. discrete value.
The task of the regression algorithm is to map the
The task of the classification algorithm is to map the
input value (x) with the continuous output
input value(x) with the discrete output variable(y).
variable(y).
Regression Algorithms are used with continuous
Classification Algorithms are used with discrete data.
data.
In Regression, we try to find the best fit line, which
In Classification, we try to find the decision boundary,
can predict the output more accurately. which can divide the dataset into different classes.
Classification Algorithms can be used to solve
Regression algorithms can be used to solve the
classification problems such as Identification of spam
regression problems such as Weather Prediction,
emails, Speech Recognition, Identification of cancer
House price prediction, etc.
cells, etc.
The regression Algorithm can be further divided into The Classification algorithms can be divided into
Linear and Non-linear Regression. Binary Classifier and Multi-class Classifier.
Pr. Soufiane HAMIDA 19
Choosing the most appropriate algorithm
1. Problem Nature: Classification or Regression
2. Data Characteristics: Size of the Dataset, Feature Types, Feature
Dimensionality, Data Quality, …
3. Model Complexity and Interpretability: Complexity, Interpretability,
4. Experience and Domain Knowledge: Previous Successes and Expertise,
5. Model Updates and Scalability: Static vs. Dynamic Data, Scalability, ..
Pr. Soufiane HAMIDA 20
Performance Evaluation
Generalization and overfitting
Main challenge of Supervised learning:
• It is relatively easy to train a model that “works” well (low prediction error) on
the training data. Extreme example: learning “by rote”
• Generalization: ability of the model to make good predictions on data whose
label is unknown.
• Overfitting: when performance is better on learning data than on new data.
30/03/2022 22
Over-fitting et Under-fitting
1. Over-fitting - Example
• Over-fitting occurs when the model gets so close to the function that it
pays too much attention to noise. The model learns the relationship
between entities and labels in so much detail and picks up the noise.
23
Over-fitting et Under-fitting
2. Under-fitting - Example
• Under-fitting is the opposite of over-fitting. This is when the model
does not approximate the function well enough and is therefore unable
to capture the underlying trend of the data.
24
Over-fitting et Under-fitting
25
Training and test set
30/03/2022 27
Cross validation
• To use all the data for training and validation
• To obtain an average performance
• We separate the data set into K blocks (folds)
• In practice, K=5 or K=10 most often (balance between the number of
experiments and the size of each training set)
We use each of the blocks in turn as a validation set and the union of the others
as a training set.
30/03/2022 28
Cross validation
30/03/2022 29
Cross validation
30/03/2022 30
Model Selection: Validation Set
How to determine the best model among those learned:
- with different learning algorithms;
- with different hyperparameter(s) values for the same algorithm?
• Idea: Select the one with the best performance on the test set.
• Problem: we can no longer determine the generalization error because test
data has already been used.
- We separate the data into 3 sets: learning, validation and test.
30/03/2022 31
Model Selection: Cross-Validation
30/03/2022 32
Model Selection: Cross-Validation
30/03/2022 33
Hyper-parameters Tuning
GridSearchCV systematically works through multiple combinations of
parameter tunes, cross-validating as it goes to determine which tune gives the
best performance. It's thorough but can be slow for large datasets and many
parameters.
RandomSearchCV samples a fixed number of parameter settings from specified
distributions. This approach can be faster and more efficient, especially when
dealing with a large hyper-parameter space, as it doesn't try every combination
but selects at random to sample a wide range of values.
30/03/2022 34
Hyper-parameters Tuning
30/03/2022 35
Hyper-parameters Tuning
30/03/2022 36
Hyper-parameters Tuning - Example
30/03/2022 37
Hyper-parameters Tuning - Example
30/03/2022 38
Evaluation of a Classification model: Confusion Matrix
• The confusion matrix is a matrix used to determine the performance of
the classification models for a given set of test data. It can only be
determined if the true values for test data are known.
• The matrix itself can be easily understood, but the related terminologies
may be confusing. Since it shows the errors in the model performance in
the form of a matrix, hence also known as an error matrix.
30/03/2022 39
Confusion Matrix in Machine Learning
Some features of Confusion matrix are given below:
• For the 2 prediction classes of classifiers, the matrix is of 2*2 table, for 3 classes, it is
3*3 table, and so on.
• The matrix is divided into two dimensions, that are predicted values and actual
values along with the total number of predictions.
• Predicted values are those values, which are predicted by the model, and actual
values are the true values for the given observations.
30/03/2022 40
Confusion Matrix in Machine Learning
• It looks like the below table:
30/03/2022 41
Confusion Matrix in Machine Learning
• It looks like the below table:
30/03/2022 42
Confusion Matrix in Machine Learning
From the previous example, we can conclude that:
• The table is given for the two-class classifier, which has two predictions "Yes"
and "NO." Here, Yes defines that patient has the disease, and No defines that
patient does not has that disease.
• The classifier has made a total of 100 predictions. Out of 100 predictions, 89
are true predictions, and 11 are incorrect predictions.
• The model has given prediction "yes" for 32 times, and "No" for 68 times.
Whereas the actual "Yes" was 27, and actual "No" was 73 times.
30/03/2022 43
Multi-class classification : Confusion Matrix
Classe prédite
Classe réelle
Classe réelle
Classe prédite
Binary classification problem Multiclass classification problem
30/03/2022 44
• Introduction
Calculations using Confusion Matrix
We can perform various calculations for the model, such as the model's
accuracy, using this matrix. These calculations are given below:
TP
Sensitivity=
TP + FN
TP + TN
Accuracy=
TP + TN + FP + FN
TP
Precision= TN
TP + FP Specificity=
TN + FP
30/03/2022 S.HAMIDA 45
ROC Curve
ROC Curve: The ROC is a graph displaying
a classifier's performance for all possible
thresholds. The graph is plotted between
the true positive rate (on the Y-axis) and the
false Positive rate (on the x-axis).
30/03/2022 46
Evaluation of a regression model
30/03/2022 47
Some ML Algorithms
ML Algorithms
Pr. Soufiane HAMIDA 49
Regression solutions
Types of Regression Algorithm:
1. Simple Linear Regression
2. Multiple Linear Regression
3. Polynomial Regression
4. K-Nearest Neighbors Regression
5. Decision Tree Regression
6. Random Forest Regression
7. ANN
8. …..
Pr. Soufiane HAMIDA 51
Classification solutions
Classification Algorithms can be further divided into the following types:
1. K-Nearest Neighbors (KNN)
2. Decision Tree
3. Random Forest
4. Support Vector Machines (SVM)
5. Artificial Neural Networks
6. Logistic Regression (LR)
7. Naïve Bayes
8. ….
Pr. Soufiane HAMIDA 52
K-Nearest Neighbors Algorithm (KNN)
Pr. Soufiane HAMIDA 53
K-Nearest Neighbors Algorithm (KNN)
K-NN (K-NEAREST NEIGHBORS) algorithm is one of the simplest
classification algorithms and it is used to identify data points that are
separated into multiple classes in order to predict the classification of a new
data point. 'sample.
K-NN is a non-parametric and lazy learning algorithm. It classifies new
cases based on a similarity measure (i.e. distance functions).
Pr. Soufiane HAMIDA 54
K-Nearest Neighbors Algorithm (KNN)
Pr. Soufiane HAMIDA 55
KNN Algorithm - Example
Input data:
A dataset D.
A distance definition function d.
An integer K
For a new observation X for which we want to predict its output variable y Do:
1. Calculate all the distances of this observation X with the other observations in
the dataset D
2. Retain the K observations from the dataset D closest to X using the distance
calculation function d
3. Take the values of y of the K observations retained:
1. If we perform a regression, calculate the mean (or median) of y retained
2. If we carry out a classification, calculate the mode of retention
4. Return the value calculated in step 3 as the value that was predicted by K-NN
for observation X.
End Algorithm
Pr. Soufiane HAMIDA 57
K-Nearest Neighbors Algorithm (KNN)
To predict category label ݕof a new point ࢞ (classification):
• Find k nearest neighbors (according to some distance metric)
• Assign the majority label to the new point
To predict numeric value ݕof a new point ࢞ (regression):
• Find k nearest neighbors
• “Average” the values associated with the neighbors
If we change k we may get a different prediction !!
Pr. Soufiane HAMIDA 58
kNN Prediction: What Label?
Pr. Soufiane HAMIDA 59
KNN Algorithm - Example
Pr. Soufiane HAMIDA 60
Linear and Logistic Regression
algorithm
Pr. Soufiane HAMIDA 61
Linear Regression algorithm
Pr. Soufiane HAMIDA 62
Linear Regression algorithm
Pr. Soufiane HAMIDA 63
Linear Regression algorithm
Pr. Soufiane HAMIDA 64
Linear Regression algorithm
Pr. Soufiane HAMIDA 65
Linear Regression algorithm
Pr. Soufiane HAMIDA 66
The Math Behind LR
Pr. Soufiane HAMIDA 67
The Math Behind LR
Pr. Soufiane HAMIDA 68
The Math Behind LR
Pr. Soufiane HAMIDA 69
The Math Behind LR
Pr. Soufiane HAMIDA 70
LR & LR - Difference
Pr. Soufiane HAMIDA 71
LR & LR - Difference
Pr. Soufiane HAMIDA 72
LR & LR - Difference
Pr. Soufiane HAMIDA 73
LR & LR - Difference
Pr. Soufiane HAMIDA 74
LR & LR - Difference
Pr. Soufiane HAMIDA 75
LR & LR - Difference
Pr. Soufiane HAMIDA 76
LR & LR - Difference
Pr. Soufiane HAMIDA 77
LR & LR - Difference
Pr. Soufiane HAMIDA 78
Applications of LR
Pr. Soufiane HAMIDA 79
Applications of LR
Pr. Soufiane HAMIDA 80
Applications of LR
Pr. Soufiane HAMIDA 81
Use case – Predicting Numbers
Pr. Soufiane HAMIDA 82
Use case – Predicting Numbers
Pr. Soufiane HAMIDA 83
Use case – Predicting Numbers
Pr. Soufiane HAMIDA 84
Use case – Predicting Numbers
Pr. Soufiane HAMIDA 85
Use case – Predicting Numbers
Pr. Soufiane HAMIDA 86
Use case – Predicting Numbers
Pr. Soufiane HAMIDA 87
Use case – Predicting Numbers
Pr. Soufiane HAMIDA 88
Use case – Predicting Numbers
Pr. Soufiane HAMIDA 89
Use case – Predicting Numbers
Pr. Soufiane HAMIDA 91
Naive Bayes algorithm
Pr. Soufiane HAMIDA 92
Naive Bayes algorithm
• Naive Bayes Classifier is a popular algorithm in Machine Learning. It is a
Supervised Learning algorithm used for classification. It is particularly
useful for text classification problems.
• The naive Bayes classifier is based on Bayes' theorem. The latter is a classic of
probability theory. This theorem is based on conditional probabilities.
Pr. Soufiane HAMIDA 93
Naive Bayes algorithm
Conditionelles probabilites:
• What is the probability of an event produced?
• Know that someone other event has already happened.
Pr. Soufiane HAMIDA 94
Naive Bayes algorithm - Example
Pr. Soufiane HAMIDA 95
Naive Bayes algorithm - Example
Pr. Soufiane HAMIDA 96
Naive Bayes algorithm - Example
Pr. Soufiane HAMIDA 97
Naive Bayes algorithm - Example
NO
Pr. Soufiane HAMIDA 98
Naive Bayes algorithm - USE CASES
The naive bayes classifier can be applied in various scenarios, one of the
classic use cases for this learning model is the classification of documents. It
involves determining whether a document corresponds to certain categories
or not. It’s used for:
• Spam filtering.
• Sentiment analysis.
• Recommendation systems.
Pr. Soufiane HAMIDA 99
PW
Pr. Soufiane HAMIDA 100
Unsupervised Machine Learning
What is Unsupervised Learning?
• As the name suggests, unsupervised learning is a machine learning technique in
which models are not supervised using training dataset. Instead, models itself
find the hidden patterns and insights from the given data. It can be compared to
learning which takes place in the human brain while learning new things.
• Unsupervised learning cannot be directly applied to a regression or classification
problem because unlike supervised learning, we have the input data but no
corresponding output data. The goal of unsupervised learning is to find the
underlying structure of dataset, group that data according to similarities, and
represent that dataset in a compressed format.
Pr. Soufiane HAMIDA 102
Example - Unsupervised Learning
• Suppose the unsupervised learning algorithm is given an
input dataset containing images of different types of cats and
dogs. The algorithm is never trained upon the given dataset,
which means it does not have any idea about the features of
the dataset. The task of the unsupervised learning algorithm
is to identify the image features on their own. Unsupervised
learning algorithm will perform this task by clustering the
image dataset into the groups according to similarities
between images.
Pr. Soufiane HAMIDA 103
Why use Unsupervised Learning?
Below are some main reasons which describe the importance of Unsupervised Learning:
• Unsupervised learning is helpful for finding useful insights from the data.
• Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
• Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
• In real-world, we do not always have input data with the corresponding output so
to solve such cases, we need unsupervised learning.
Pr. Soufiane HAMIDA 104
Working of Unsupervised Learning
Working of unsupervised learning can be understood by the below diagram:
Pr. Soufiane HAMIDA 105
Types of Unsupervised Learning Algorithm
Below is the list of some popular unsupervised learning algorithms:
• K-means clustering
• Hierarchal clustering
• Anomaly detection
• Independent Component Analysis
• Apriori algorithm
Pr. Soufiane HAMIDA 106
Advantages of Unsupervised Learning
• Unsupervised learning is used for more complex tasks as compared to
supervised learning because, in unsupervised learning, we don't have labeled
input data.
• Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.
Pr. Soufiane HAMIDA 107
Disadvantages of Unsupervised Learning
• Unsupervised learning is intrinsically more difficult than supervised learning as
it does not have corresponding output.
• The result of the unsupervised learning algorithm might be less accurate as
input data is not labeled, and algorithms do not know the exact output in
advance.
Pr. Soufiane HAMIDA 108
K-Means Clustering Algorithm
K-Means Clustering Algorithm
• K-Means Clustering is an unsupervised learning algorithm that is used to solve
the clustering problems in machine learning or data science.
• K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Here K defines the number of pre-
defined clusters that need to be created in the process, as if K=2, there will be
two clusters, and for K=3, there will be three clusters, and so on.
Pr. Soufiane HAMIDA 110
K-Means Clustering Algorithm
• It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own without the
need for any training.
• It is a centroid-based algorithm, where each cluster is associated with a centroid.
The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.
Pr. Soufiane HAMIDA 111
K-Means Clustering Algorithm
• The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters. The value of
k should be predetermined in this algorithm.
• The k-means clustering algorithm mainly performs two tasks:
1. Determines the best value for K center points or centroids by an iterative process.
2. Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.
• Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
Pr. Soufiane HAMIDA 112
K-Means Clustering Algorithm
Pr. Soufiane HAMIDA 113
How does the K-Means Algorithm Work?
• The working of the K-Means algorithm is explained in the below steps:
1. Step-1: Select the number K to decide the number of clusters.
2. Step-2: Select random K points or centroids. (It can be other from the input dataset).
3. Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
4. Step-4: Calculate the variance and place a new centroid of each cluster.
5. Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
6. Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
7. Step-7: The model is ready.
Pr. Soufiane HAMIDA 114
How does the K-Means Algorithm Work?
• Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:
Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
Pr. Soufiane HAMIDA 115
How does the K-Means Algorithm Work?
• We need to choose some random k points or
centroid to form the cluster. These points can
be either the points from the dataset or any
other point. So, here we are selecting the
below two points as k points, which are not
the part of our dataset. Consider the
following image:
Pr. Soufiane HAMIDA 116
How does the K-Means Algorithm Work?
• Now we will assign each data point of the
scatter plot to its closest K-point or centroid.
We will compute it by applying some
mathematics that we have studied to
calculate the distance between two points. So,
we will draw a median between both the
centroids.
Pr. Soufiane HAMIDA 117
How does the K-Means Algorithm Work?
• From the previous image, it is clear that
points left side of the line is near to the K1 or
blue centroid, and points to the right of the
line are close to the yellow centroid. Let's
color them as blue and yellow for clear
visualization.
Pr. Soufiane HAMIDA 118
How does the K-Means Algorithm Work?
• As we need to find the closest cluster, so we will repeat the process by choosing
a new centroid. To choose the new centroids, we will compute the center of
gravity of these centroids, and will find new centroids as follow:
Pr. Soufiane HAMIDA 119
How does the K-Means Algorithm Work?
• Next, we will reassign each datapoint to the
new centroid. For this, we will repeat the
same process of finding a median line. The
median will be like following image:
Pr. Soufiane HAMIDA 120
How does the K-Means Algorithm Work?
• From the previous image, we can see, one
yellow point is on the left side of the line, and
two blue points are right to the line. So, these
three points will be assigned to new
centroids.
Pr. Soufiane HAMIDA 121
How does the K-Means Algorithm Work?
• As reassignment has taken place, so we will
again go to the step-4, which is finding new
centroids or K-points.
• We will repeat the process by finding the center
of gravity of centroids, so the new centroids will
be as shown in the following image:
Pr. Soufiane HAMIDA 122
How does the K-Means Algorithm Work?
• As we got the new centroids so again will
draw the median line and reassign the data
points. So, the image will be:
Pr. Soufiane HAMIDA 123
How does the K-Means Algorithm Work?
• We can see in the following image; there are no dissimilar data points on either
side of the line, which means our model is formed.
Pr. Soufiane HAMIDA 124
How does the K-Means Algorithm Work?
• As our model is ready, so we can now remove the assumed centroids, and the two
final clusters will be as shown in the below image:
Pr. Soufiane HAMIDA 125
How to choose the value of "K number of clusters"
• The performance of the K-means clustering algorithm depends upon highly
efficient clusters that it forms. But choosing the optimal number of clusters
is a big task. There are some different ways to find the optimal number of
clusters, but here we are discussing the most appropriate method to find the
number of clusters or value of K. The method is given below:
Pr. Soufiane HAMIDA 126
Elbow Method
• The Elbow method is one of the most popular ways to find the optimal number of
clusters. This method uses the concept of WCSS value. WCSS stands for Within
Cluster Sum of Squares, which defines the total variations within a cluster. The
formula to calculate the value of WCSS (for 3 clusters) is given below:
Pr. Soufiane HAMIDA 127
Elbow Method
To find the optimal value of clusters, the elbow method follows the below steps:
• It executes the K-means clustering on a given dataset for different K values
(ranges from 1-10).
• For each value of K, calculates the WCSS value.
• Plots a curve between calculated WCSS values and the number of clusters K.
• The sharp point of bend or a point of the plot looks like an arm, then that point
is considered as the best value of K.
Pr. Soufiane HAMIDA 129
Elbow Method
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as
the elbow method. The graph for the elbow method looks like the below image:
Pr. Soufiane HAMIDA 130
PW
Pr. Soufiane HAMIDA 132
Any questions ?
The End
Any questions ?
Pr. Soufiane HAMIDA 134