Spring 2025
ECE 490: Introduction to
Machine Learning
Chapter 4: Supervised Machine Learning ALgorithms
Machine Learning Lifecycle
ECE 490: Introduction to ML 2
Where are we in the life cycle now?
We are here
ECE 490: Introduction to ML 3
Supervised Learning
ECE 490: Introduction to ML 4
Recap: Supervised Learning
Supervised learning is a category of machine learning that uses labeled datasets
to train algorithms to predict outcomes and recognize patterns.
ECE 490: Introduction to ML 5
Algorithm vs Model
A model is the outcome of training an algorithm on data; it represents the
learned patterns, relationships, or predictions based on the training process.
ECE 490: Introduction to ML 6
Algorithm vs Model
ECE 490: Introduction to ML 7
Types of Supervised Machine Learning
(SML) Applications
ECE 490: Introduction to ML 8
Classification vs Regression
ECE 490: Introduction to ML 9
Classification
Classification is a type of supervised learning where the goal is to predict the
categorical label of new observations based on past observations.
ECE 490: Introduction to ML 10
Regression
Regression is another key type of supervised learning that focuses on predicting
continuous numerical values rather than categorical labels.
ECE 490: Introduction to ML 11
Examples of SML Applications
ECE 490: Introduction to ML 12
Optical Character Recognition (OCR)
The model is able to identify handwritten characters and classify each image as a
character. In this case, we are classifying the number digits.
ECE 490: Introduction to ML 13
Email Prioritization
The model is able to successfully detect which of the arriving emails go to spam
and which to the primary inbox.
ECE 490: Introduction to ML 14
Language Translation
The model is able to take in a sequence in
one language and output a sequence of the
same information in a different language.
Would this be classification or regression?
ECE 490: Introduction to ML 15
Language Translation
Language translation is a classification
problem in because it involves predicting the
next word (or token) from a predefined
vocabulary, which can be seen as a list of
possible "classes." Each word in the
vocabulary is effectively a "class" that the
model selects based on the input context.
ECE 490: Introduction to ML 16
Linear Regression
ECE 490: Introduction to ML 17
Linear Regression
Linear regression is a statistical method
used to model the relationship between a
the target variable and a feature variable
by a line.
Based on the training data points, we try to
create a line that best models the
relationships between the input feature and
the output feature.
ECE 490: Introduction to ML 18
Applications of Linear Regression
ECE 490: Introduction to ML 19
Linear Regression
Given that our dataset has one input feature and an output feature, let’s plot our
data.
ECE 490: Introduction to ML 20
Linear Regression
We want to create a line that models the training data points in the lowest error
possible. We call the modeled relationship: “the best fit line”.
- Best fit line
- Predicted line
ECE 490: Introduction to ML 21
Linear Regression
Data point
Label
Predicted value
ECE 490: Introduction to ML Input feature value 22
Linear Regression
We want to create a line that will not only
model the current data points, but will also
allow us to predict future outputs with high
accuracy.
Predicted value
New input (value not
seed in the dataset)
ECE 490: Introduction to ML 23
Linear regression
To get to the best fit line, we will go
through multiple iterations of updates.
What is the model updating (or “learning”)?
In the case of creating a line, we update
the slope, and y-intercept.
ECE 490: Introduction to ML 24
Linear Regression
Since the model is trying to create a best fit line, it is optimizing the equation of a
line.
Predicted value
Input feature value
Model trainable
parameters
ECE 490: Introduction to ML 25
“Learning” of machine learning algorithms
We call the variables being updated during the training process as “trainable
parameters” or “weights”. In some cases, we also have a “bias” term as a
trainable parameter.
- Trainable parameters because they are getting “trained” or updated during
the learning process.
- Weights because they also carry the importance and contribution of each
input value.
- Bias because it helps direct the predictions of the model for higher
accuracy.
ECE 490: Introduction to ML 26
Linear Regression
In the case of linear regression, we have both a weight and a bias in our model.
Bias term Weight
Model trainable
parameters
ECE 490: Introduction to ML 27
How SML algorithms learn
ECE 490: Introduction to ML 28
Learning flow of SML models
ECE 490: Introduction to ML 29
SML training process
The training (or learning) process of
supervised machine learning algorithms
consists of 4 parts:
1. Preparing the training dataset
2. Initializing the algorithm
3. Loop over data points in the training
set:
a. Make a prediction (Class or Number)
b. Calculate the error of the prediction
4. Update the algorithm parameters
ECE 490: Introduction to ML 30
Preparing the training dataset
The data preparation steps, as
covered in the previous chapters,
need:
1. EDA and feature tuning
2. Transformation to numerical
representation
3. Split the dataset for training
and testing
ECE 490: Introduction to ML 31
Training Linear Regression
We mentioned that linear regression has one bias term
and one weight that needs to be updated using the
training loop. Let’s see how that happens.
First, we will use a dataset of randomly created points.
ECE 490: Introduction to ML 32
Training Linear Regression
Since the dataset is random and does not actually hold
any information, we don’t need to perform EDA.
We will only pre-process the data by normalizing it.
ECE 490: Introduction to ML 33
Initializing the algorithm
Every machine learning algorithm holds its
learned patterns and connections within its
parameters, often referred to as “trainable
parameters”.
These trainable parameters, at the
beginning of the training process, are
randomly initialized to be updated and
“learned” during the training process.
ECE 490: Introduction to ML 34
Linear Regression Example
So, we will choose random initializations for our trainable parameters.
ECE 490: Introduction to ML 35
Linear Regression Example
Let’s visualize initialized line
ECE 490: Introduction to ML 36
Predicted value
Looping over data points
For every data point, we feed the data
points to the model so it can make a
Error
prediction.
This prediction would not be accurate, so it
contains some error.
Real value
Input data point
ECE 490: Introduction to ML 37
Looping over data points
This error that we got, which is the difference between real and predicted values,
should guide our update of the trainable parameters.
The goal is to update the trainable parameters so that their update results in
a lower error.
ECE 490: Introduction to ML 38
Error calculation
For regression tasks, we have multiple
options for calculating the error between the
real and predicted values:
1. Mean Bias Error (MBE)
2. Mean Squared Error (MSE)
3. Mean Absolute Error (MAE)
4. Root Mean Squared Error (RMSE)
5. Huber Loss
6. Mean Logarithmic Error (MLE)
7. …
ECE 490: Introduction to ML 39
Error calculation - Loss function
If we calculate the error for more than one data point at a time, this is done using a
loss function.
The loss function aggregates the individual errors across the selected data
points, typically by computing the average or sum of the errors using a specified
error function (e.g., MSE, MAE).
ECE 490: Introduction to ML 40
Error Calculation - Mean Bias Error
Measures the average difference between predicted and actual values. It indicates
the direction of the error (positive or negative bias).
- Positive MBE: Overestimation.
- Negative MBE: Underestimation.
Useful for understanding bias in predictions but not ideal as a standalone metric
because it doesn't capture the magnitude of errors.
Error Function: Loss Function:
ECE 490: Introduction to ML 41
Linear Regression Example - MBE
In the linear regression model we initialized earlier, let’s try to get a prediction from
the first data point in the training set and calculate its MBE.
ECE 490: Introduction to ML 42
Error Calculation - Mean Squared Error
Measures the average squared differences between predicted and actual values.
Squaring emphasizes larger errors, making it sensitive to outliers.
Use this error when you want to penalize large errors more heavily or when
outliers are meaningful.
Error Function: Loss Function:
ECE 490: Introduction to ML 43
Linear Regression Example - MSE
Using the MSE, we see a significant difference in the error.
Both MSE and MBE can guide the update of trainable parameters during model
training, but they serve different purposes. However MSE is more commonly
used to minimize overall prediction error, and MBE is used to provide insight
into whether the model consistently overestimates or underestimates.
ECE 490: Introduction to ML 44
Error Calculation - Mean Absolute Error
Measures the average of absolute differences between predicted and actual
values. It treats all errors equally, making it robust to outliers.
Useful for when you want a simple, interpretable metric that is less sensitive to
outliers than MSE.
Error Function: Loss Function:
ECE 490: Introduction to ML 45
Linear Regression Example - MAE
The MAE provides a more robust measure of error by treating all deviations
equally, without disproportionately penalizing large errors, unlike MSE. This
characteristic makes MAE less sensitive to outliers and provides a more balanced
reflection of typical errors in the model.
ECE 490: Introduction to ML 46
Error Calculation - Root Mean Squared Error
The square root of MSE. It provides the error in the same units as the target
variable, making it more interpretable than MSE.
Useful for when you want a metric in the same scale as the target variable, while
still penalizing large errors more heavily.
Suppose you're predicting house prices in dollars. RMSE provides an error value
(e.g., $5,000) that is also in dollars. This tells you that, on average, your model's
prediction is about $5,000 off from the actual value.
Error Function: Loss Function:
ECE 490: Introduction to ML 47
Linear Regression Example - RMSE
In this case, we got RMSE and MAE as the same value. This makes sense
because we calculating the errors for one data point.
Both are expressed in the same units as the target variable, but RMSE can
sometimes overemphasize large errors, which might distort the perception of the
model's performance.
ECE 490: Introduction to ML 48
Error Calculation - Huber Loss
Combines the properties of MSE and MAE. It behaves like MSE for small errors
and switches to MAE for large errors, making it robust to outliers while maintaining
sensitivity to small errors.
Error Function:
You set the threshold
(δ) according to the size
Loss Function: of your dataset
ECE 490: Introduction to ML 49
Linear Regression Example - Huber Loss
In our case, since the error > delta, then we applied the modified version of MAE
as the result of the huber loss.
ECE 490: Introduction to ML 50
Error Calculation - MLE
Measures the error logarithmically, which reduces the impact of large errors. This
metric is useful when the target values vary over several orders of magnitude.
Useful for when handling data with widely varying scales or when large errors are
undesirable but should not dominate the metric.
Error Function:
Loss Function:
ECE 490: Introduction to ML 51
Linear Regression Example - MLE
As you can see MLE tends to be much smaller than other error metrics like MSE,
MAE, or RMSE. This is because it focuses on relative error rather than outlier
impact.
ECE 490: Introduction to ML 52
Regression Error Functions Recap
ECE 490: Introduction to ML 53
Updating Algorithm Trainable Parameters
Now, the algorithm is initialized, and we made a prediction (or a set of predictions)
and calculated their error using the chosen loss function.
We want to utilize this error to perform an update in to the trainable parameters.
Intuitively, we want to update the trainable parameters in a way that will minimize
the error.
This process is performed using a family of algorithms called optimization
functions.
ECE 490: Introduction to ML 54
Optimization Algorithms - Gradient Descent
For now, we will focus on the most popular optimization algorithm: Gradient
Descent, which can be used in the training of ANY machine learning
algorithm.
Gradient descent is simply used to find the values of a function’s parameters
(coefficients) that minimize a cost function as much as possible.
ECE 490: Introduction to ML 55
Optimization Algorithms - Gradient Descent
If we were to plot the cost (or loss) function with respect to the change in
value of one trainable parameter, we would get a 2D curve.
Input data point
ECE 490: Introduction to ML 56
Optimization Algorithms - Gradient Descent
If we were to plot the error with respect to the change in value of two trainable
parameter, we would get a 3D curve.
Desired point: lowest value of the error
ECE 490: Introduction to ML 57
Optimization Algorithms - Gradient Descent
To get the lowest error, we should get the minimum of the loss (or cost) function.
How do we get the global minimum of a function?
ECE 490: Introduction to ML 58
Optimization Algorithms - Gradient Descent
The global minimum of a function is where the derivative of the function is =0.
Thus, we want to move the parameter closer to where the the loss is at its
minimum through incremental steps which consider the derivative of the loss
function.
Where the
derivative of the
function = 0
ECE 490: Introduction to ML 59
Optimization Algorithms - Gradient Descent
Thus, the update moves as follows:
ECE 490: Introduction to ML 60
Optimization Algorithms - Gradient Descent
Gradient Descent (GD) is an iterative process where you start at a coefficient’s
initial point (x0) and you move step by step until you reach the minimum of the
loss function.
The update rule of your position is given by this formula:
ECE 490: Introduction to ML 61
Optimization Algorithms - Gradient Descent
Let’s dissect the update rule of gradient descent:
New_value = old_value - learning_rate * derivative_of _cost _function_wrt_param
1. New_value: The updated value of the parameter.
2. old_value: The previous value of the parameter. In the case of the first
update, this is the random initialization value of the parameter.
3. learning_rate: The rate of update (how fast or slow) we will move towards the
optimal coefficients/parameters.
4. derivative_of _cost _function_wrt_param: The derivative of the cost function
used to direct the update closer to the global minimum.
ECE 490: Introduction to ML 62
Optimization Algorithms - Gradient Descent
The choice of the learning rate affects the size of the steps we are taking to get to
the minimum.
ECE 490: Introduction to ML 63
Optimization Algorithms - Gradient Descent
Let’s take MSE as the choice for our loss function and compute its derivative with
respect to the trainable parameters
ECE 490: Introduction to ML 64
Updating Algorithm Trainable Parameters
Depending on the size and complexity
of our data, we might choose to update
our parameters:
- Every data point,
- Every batch of data points,
- After the ingestion of all the training
data points.
ECE 490: Introduction to ML 65
When do we perform this update?
We can apply gradient descent:
1. After every data point: using Stochastic Gradient Descent
2. After every batch of data points: using Batch Gradient Descent
3. After all the training data points: using Gradient Descent
ECE 490: Introduction to ML 66
Linear Regression Example
Going back to our linear regression example, we can now train our algorithm by
performing the parameter update loop using gradient descent.
ECE 490: Introduction to ML 67
Linear Regression Example
We can now visualize the initial
and final model.
ECE 490: Introduction to ML 68
Multi-Linear Regression
ECE 490: Introduction to ML 69
Multi-Linear Regression
While we can model the relationship between a feature and the output using a
line, we can model the relationship between two input feature and the output using
a plane.
ECE 490: Introduction to ML 70
Multi-Linear Regression
Using the same logic, the relationship between three or more features and the
output can be modeled using a hyperplane.
ECE 490: Introduction to ML 71
Multi-Linear Regression
Thus, multi-linear regression is modeled by a polynomial. The number of
variables in the polynomial depends on the number of input features.
Bias Weights
Output
Input Features
ECE 490: Introduction to ML 72
Updates in multi-linear regression
We apply the gradient descent update rule for each weight in the multi-linear
regression model in each iteration.
Derivative of the cost function:
Update rule:
ECE 490: Introduction to ML 73
Classification Algorithms
ECE 490: Introduction to ML 74
Binary vs Multi-Class Classification
ECE 490: Introduction to ML 75
Binary vs Multi-Class Classification
ECE 490: Introduction to ML 76
What would the data look like?
ECE 490: Introduction to ML 77
Logistic Regression
ECE 490: Introduction to ML 78
Logistic Regression
Logistic Regression is a statistical method used for binary classification
problems, where the outcome variable is categorical with two possible outcomes.
ECE 490: Introduction to ML 79
Logistic Regression
So why does it have ‘Regression’ in its name if it is a classification problem?
Logistic Regression is an extension of linear regression and multi-linear
regression to be used for classification.
This extension is done by adding a mapping function that would allow us to map
the output of the linear regression part to a class.
ECE 490: Introduction to ML 80
Logistic Regression
The equation of logistic regression: We add a “Sigmoid” function after the linear
regression block.
What is this?
Multi-Linear
Regression
ECE 490: Introduction to ML 81
Logistic Regression
The sigmoid function:
Given that our threshold is
0.5: If the output of the
sigmoid function is above 0.5,
the data point belongs to one
class.
Otherwise, the data point
belongs to class B.
ECE 490: Introduction to ML 82
Logistic Regression
Should the threshold be always be
0.5? No.
This is the default threshold.
However, according to your data,
you might find other threshold more
suitable.
ECE 490: Introduction to ML 83
What would the line in LR block represent?
We saw that the sigmoid function is preceded by a linear regression block.
Instead of modeling the data, the line is used as a classifying boundary.
ECE 490: Introduction to ML 84
Trainable parameters in Logistic Regression
We know what are the trainable parameters in the (multi) linear regression block
(coefficients + bias).
Do we have any trainable parameters in the sigmoid block?
ECE 490: Introduction to ML 85
Trainable parameters in Logistic Regression
What about the threshold that we use to decide if the output of the sigmoid block
maps to class A or class B?
The threshold is pre-set and not changed or “learned” during the training process.
This means that it is a fixed parameter and not a trainable parameter.
We call these types of parameters “hyper-parameters”.
ECE 490: Introduction to ML 86
Training Classification Algorithms
ECE 490: Introduction to ML 87
Classification Training Lifecycle
To update the parameters of a classification function, we will have to follow the
same training steps as before:
1. Random initialization of training parameters
2. Passing the function through one or multiple training samples
3. Calculating the error
4. Using gradient descent to update the value of the trainable parameters
Which part do you think differs between classification and regression?
ECE 490: Introduction to ML 88
Classification Training Lifecycle
To update the parameters of a classification function, we will have to follow the
same training steps as before:
1. Random initialization of training parameters
2. Passing the function through one or multiple training samples
3. Calculating the error
4. Using gradient descent to update the value of the trainable parameters
We cannot measure the error in using the same metrics between classification
and regression.
ECE 490: Introduction to ML 89
Classification Error Metrics- Binary
Let’s say we have this example where we will predict if a person is male (1) or
female (0) based on their height
ECE 490: Introduction to ML 90
Classification Error Metrics- Binary
The output of the logistic regression function will be the output of the sigmoid
function. This means that it will be a float between 0 and 1.
ECE 490: Introduction to ML 91
Binary Cross Entropy
The goal of training is to maximize the likelihood that the model assigns to the
correct labels. Instead of maximizing likelihood directly, we minimize the negative
log-likelihood, leading to:
ECE 490: Introduction to ML 92
Binary Cross Entropy
This function penalizes wrong predictions exponentially by taking the log of
predicted probabilities. Logarithmic scaling ensures that confident wrong
predictions are penalized much more than weakly wrong predictions.
ECE 490: Introduction to ML 93
Binary Cross Entropy
- Cross-entropy measures the distance
between two probability distributions:
Means that it is used to calculate - True distribution P (actual labels)
the error and update binary - Predicted Distribution Q (model’s
classification problems.
predictions)
The loss function we saw is called Binary Cross Entropy.
Entropy measures the uncertainty
or unpredictability of a probability
distribution. If an event is certain,
ECE 490: Introduction to ML entropy is low 94
Binary Cross Entropy
Imagine we have two models predicting the probability of an image being a "cat".
ECE 490: Introduction to ML 95
Binary Cross Entropy
ECE 490: Introduction to ML 96
Binary Cross Entropy: Loss vs Cost
As we mentioned before, a loss (or error) is for one data sample while cost if for
multiple data samples.
ECE 490: Introduction to ML 97
Classification Error Metrics
So far, with logistic regression, we saw a binary classification model that outputs a
value between 0 and 1- where:
● Anything below the threshold (0.5) belongs to one class and anything above
belongs to another.
What if we have multiple classes?
ECE 490: Introduction to ML 98
Multi-Class Classification
Multi-class classification
models output an array of
probabilities instead of one
probability (likelihood of
belonging to class A).
ECE 490: Introduction to ML 99
Classification Error Metrics
For multi-class classification, we use Categorical Cross Entropy instead of
Binary Cross Entropy.
k is the number of
classes.
Predicted value
sample i belongs to
class j
We sum over all classes for
each sample since one-hot True value sample i
encoding ensures only the true belongs to class j
class contributes to the loss.
ECE 490: Introduction to ML 100
Categorical Cross Entropy
ECE 490: Introduction to ML 101
Logistic Regression for MultiClass Classification
Can we use logistic regression for multiclass
classification? Yes.
To do so, we have two options:
1. One vs All algorithm. We train multiple
binary logistic regression models, each
distinguishing one class from all others.
2. If we switch the sigmoid function with a
softmax function, we can output an array of
probabilities instead of a singular probability
value.
ECE 490: Introduction to ML 102
Gradient Descent using BCE and CCE
We perform the sample steps in the gradient descent update where we also derive
the cost function to insert it in the gradient update rule
ECE 490: Introduction to ML 103
Training logistic regression from scratch
in lab ‘Classification Algorithms’
ECE 490: Introduction to ML 104
Naive Bayes
ECE 490: Introduction to ML 105
Naive Bayes
Naive Bayes is a probability-based algorithm. In logistic regression, which was a
linear-regression based algorithm that is used to output a probability.
However, in Naive Bayes, we use a probability function to calculate the
probability of a data point belonging to a class.
ECE 490: Introduction to ML 106
Probability Recap
Calculating the probability of an event is to find how much likely this event is to
happen.
Probabilities of calibrated dice:
ECE 490: Introduction to ML 107
Probability Recap
Expectation: Summation of all possible values of a random variable, multiplied by
the probability of each.
ECE 490: Introduction to ML 108
Probability Recap
In most cases, the probability is not equally distributed.
● What if we don’t know if the dice is calibrated?
● What if we don’t know if the coin is calibrated?
● Then we don’t know the probabilities, we need to approximate them.
ECE 490: Introduction to ML 109
Naive Bayes
The Naive bayes algorithm is based on the Bayes Theorem for conditional
probability.
Let’s remember conditional probability:
ECE 490: Introduction to ML 110
Naive Bayes
How does conditional probability apply to prediction problems?
In our case of classification, we want to calculate the probability of a data point
belonging to class A, to class B, to class C… Then, we will use the highest
probability to select the class.
What guides our decision to choosing the class? The information in the features.
ECE 490: Introduction to ML 111
Naive Bayes
What guides our decision to choosing the class? The information in the features.
ECE 490: Introduction to ML 112
Posterior Probability
ECE 490: Introduction to ML 113
Naive Bayes Training
Unlike KNN, which stores all training examples, Naïve Bayes compresses the data
into a small set of probability values.
During training, it computes:
- Class priors: P(Ck)– the probability of each class occurring.
- Feature likelihoods: P(X∣Ck) – how often a feature takes a certain value,
given a class.
- For categorical data: It counts the occurrences of feature values for each
class and normalizes them into probabilities.
- For continuous data: It estimates the mean and variance of the feature
values per class.
ECE 490: Introduction to ML 114
Naive Bayes Training
After computer class priors and
feature likelihoods, it applies Bayes'
Theorem to compute posterior
probabilities.
This makes it computationally efficient
because training is just counting and
computing probabilities—there’s no
iterative optimization.
ECE 490: Introduction to ML 115
Naive Bayes Training
To summarize, the Naive Bayes algorithm is trained using three steps:
1. Counts feature occurrences per class.
2. Estimates likelihoods (of features) using probability distributions (e.g.,
Gaussian for continuous data, Multinomial for text data).
3. Applies Bayes' Theorem to compute posterior probabilities.
ECE 490: Introduction to ML 116
Naive Bayes Example
To solidify how Naïve Bayes functions, we will consider an
example with one feature.
The question we want to answer: Kids will play if the weather
is sunny?
How do we start with applying Naive Bayes?
ECE 490: Introduction to ML 117
Naive Bayes Example
Step 1: Count feature occurrences per class.
ECE 490: Introduction to ML 118
Naive Bayes Example
Step 2: Calculate the class probabilities and feature likelihoods.
In order to calculate the likelihood of kids playing with respect to the weather
condition, we begin by computing the probability of each condition and event.
ECE 490: Introduction to ML 119
Naive Bayes Example
Step 3: Apply Bayes' Theorem to compute posterior probabilities.
The model assigns the class with the highest probability
ECE 490: Introduction to ML 120
Naive Bayes Assumptions
We notice that we calculate the feature likelihoods without taking into
consideration more than one feature at a time.
This means that we assume there are no feature correlation or dependencies.
This assumption may be false for a lot of use cases where there are at least minor
correlation between the features that should be considered for accurate modeling.
ECE 490: Introduction to ML 121
Naive Bayes Assumptions
Let’s very this assumption with an example. If we have 2 features F1 and F2.
Bayes rule will become as follows:
ECE 490: Introduction to ML 122
Naive Bayes Assumptions
So in case of 2 features and 2 classes, for example, we compute the following
probabilities:
ECE 490: Introduction to ML 123
Applications of Naive Bayes
ECE 490: Introduction to ML 124
Algorithms that can be used for both
regression and classification
ECE 490: Introduction to ML 125
K-Nearest Neighbors For Classification
ECE 490: Introduction to ML 126
KNNs for Classification
It classifies a new data instances
based on the k most similar training
examples.
Similarity, in this case, is measured in
distance.
ECE 490: Introduction to ML 127
KNNs for Classification
How do KNNs choose which class the new data point belongs to?
K Nearest Neighbors
We obtain the K number of data points by
K is the number of measuring the distance between the data
nearest data points that point and all the other data points.
surround the new input. The labels of the K closest data points
are used to decide which class the new
data point belongs to.
ECE 490: Introduction to ML 128
KNNs for Classification
K is a hyperparameter that we set
before the training process.
We use a majority voting mechanism
to decide the class of the new data
sample.
ECE 490: Introduction to ML 129
KNNs for Classification
What do KNNs learn? What are the trainable parameters? Are we just comparing
every data point to the data points in out data set?
KNN is an instance-based learning algorithm, meaning that it does not
explicitly learn a function or a set of parameters during training. Instead, it
memorizes the training data and makes predictions by comparing new inputs to
stored instances.
Learning in KNN is essentially data storage and indexing, which enables
efficient nearest-neighbor searches.
ECE 490: Introduction to ML 130
KNNs for Classification
ECE 490: Introduction to ML 131
KNNs for Regression
ECE 490: Introduction to ML 132
KNNs for Regression
Instead of using a majority vote to classify a new data point, KNNs for regression
averages the value of the K nearest neighbors to estimate the value of the new
data point.
New input data point
ECE 490: Introduction to ML 133
KNNs for Regression
How KNN Regression Works:
1. Distance Calculation: For a new data point, the algorithm calculates the
distance between this point and all points in the training dataset. Common
distance metrics include Euclidean, Manhattan, and Minkowski distances.
2. Identifying Neighbors: The algorithm identifies the K data points in the
training set that are closest to the new point based on the calculated
distances.
3. Prediction: The target value for the new data point is predicted by averaging
the target values of its K nearest neighbors.
ECE 490: Introduction to ML 134
Choice of K (Classification and Regression)
When K is too small (e.g., K=1 or 2)
1. The prediction is based on very few points, meaning small fluctuations in the
data can have a big impact.
2. If there’s noise in the dataset, the model might rely too much on those noisy
points, leading to inconsistent or unstable predictions.
Think of it like asking just one or two people for advice—if they have extreme
opinions, your decision might not be well-balanced.
ECE 490: Introduction to ML 135
Choice of K (Classification and Regression)
When K is too large (e.g., K=50 or 100)
● The prediction is influenced by many points, including ones that are farther
away and might not be very similar.
● The model smooths out variations, which can make it less sensitive to specific
details in the data.
● This is like averaging opinions from a very large group—while you get a
general sense of the trend, you might lose important local nuances.
ECE 490: Introduction to ML 136
Support Vector Machine for Classification
ECE 490: Introduction to ML 137
SVMs for classification
Suppose we have data to be used for binary classification.
What would you say makes the best linear separator of the two classes?
ECE 490: Introduction to ML 138
SVMs for classification
The goal of SVM is to curate the most accurate linear separator that will correctly
classify any new input.
ECE 490: Introduction to ML 139
SVMs for classification
But what determines the best linear separator?
In support vector machine, the best linear separator is the line that maximizes the
distance between it and the closest data points from each class.
Max distance between
separator and class A
Max distance between
separator and class B
ECE 490: Introduction to ML 140
SVMs for classification
The closest data points to the separator are called the support vectors
Support vectors
ECE 490: Introduction to ML 141
SVMs for classification
This distance, called the margin, should not only be maximized, but also
equidistant between the support vectors.
Margin
ECE 490: Introduction to ML 142
SVM inference
SVM classifies new data points by inserting the
feature values in the equation of the line. Then,
decides the class by checking the position of the
new data point with respect to the line.
1. If w1x1 + w2x2 + b > 0. Then, the point falls
above the line and it belongs to class A.
2. Otherwise, the point falls below the line and
belongs to class B.
ECE 490: Introduction to ML 143
Training SVM for Classification
Assuming we have two features x1 and x2, this means that our classifier is a
straight line. The classifier equation is
Where w1 and w2 are the weights of x1 and x2 respectively and b is the bias term.
We define the weight vector w where w = (w1, w2).
ECE 490: Introduction to ML 144
Training SVM for Classification
As we know the goal of SVM is to maximize the value of the margin. The margin is
defined by:
Equation of the linear separator
So, maximizing the margin would mean to minimize ||w|| since they are inversely
proportional. To simplify the quadratic optimization problem, instead of simplifying
||w||, we will simplify its integral (½)(||w||2) which makes the function differentiable
and easier to optimize.
ECE 490: Introduction to ML 145
Training SVM for Classification
We can solve this optimization problem using Lagrange multipliers.
RECAP: The Lagrange multiplier technique lets you find the maximum or
minimum of a multivariable function (f(x,y,..)) when there is some constraint on the
input values you are allowed to use.
To ensure that all data points are correctly classified with a margin of at least 1,
the constraints are written as:
ECE 490: Introduction to ML 146
Training SVM for Classification
Now, we have a primal optimization function of
Then, we introduce the lagrange multipliers and solve for them.
ECE 490: Introduction to ML 147
Higher Dimension SVM for Classification
If we had three features, the classifier becomes a classifier plane.
ECE 490: Introduction to ML 148
Types of Margins in SVMs
Data is not perfect. Some data points may lie within the margin. Thus, we have
two types of margins:
ECE 490: Introduction to ML 149
SVM Hard Margin
Hard Margin conditions:
● The data should be linearly separable
● We select two parallel hyperplanes separating the two classes of data
● The hyperplanes are chosen so that the distance between them is as
large as possible
ECE 490: Introduction to ML 150
SVM Soft Margin
Soft Margin conditions:
● The data is not linearly separable
● We allow for errors in classification
● Finding the maximal margin means maximizing the margin between the
data points and the hyperplane
ECE 490: Introduction to ML 151
SVM Error Function
In the SVM training process, we we need
to penalize two scenarios:
1. Wrongly classified data points
2. Data points that are correctly
classified but fall within the margin
ECE 490: Introduction to ML 152
SVM Error Function
Thus, we use a new loss function called Hinge Loss which considers the two
penalties that we want.
ECE 490: Introduction to ML 153
SVM Error Function
ECE 490: Introduction to ML 154
SVM Error Function
ECE 490: Introduction to ML 155
SVM for Classification
What if our dataset is not linearly seperable?
ECE 490: Introduction to ML 156
SVM for Classification
To solve this issue, we can increase the dimensionality of our dataset by adding a
new feature. This would allow us to to use a linear separator. This solution is
called the “kernel trick”.
ECE 490: Introduction to ML 157
SVM for Classification
The kernel trick works by creating a new feature that is the function of the
available features.
Given feature X and feature Y, we add a new feature Z where X = X2 + Y2
ECE 490: Introduction to ML 158
SVM for Classification
However, after training, we don’t want to keep calculating a new feature for every
input. Thus, we need to map the input back to the lower dimensional space.
Best Separating Hyperplane Equation:
Z = constant = X2 +Y2 <=> Equation of a circle in 2D space
ECE 490: Introduction to ML 159
SVM for Classification
ECE 490: Introduction to ML 160
SVM for classification
ECE 490: Introduction to ML 161
Support Vector Regressors
ECE 490: Introduction to ML 162
SVR
Support Vector Machines (SVMs) can be
adapted for regression problems in a method
known as Support Vector Regression
(SVR).
Rather than finding a hyperplane that
separates classes, SVR seeks a function that
approximates the relationship between input
features and a continuous target variable.
ECE 490: Introduction to ML 163
Training SVR
Instead of trying to fit every data point exactly, SVR introduces a margin of
tolerance ε. Errors smaller than ε are ignored.
The ε-insensitive loss function is defined as:
ECE 490: Introduction to ML 164
Decision Trees
ECE 490: Introduction to ML 165
Decision Trees
Consider that you have a 2-hour break between classes, and you’re looking for a
place to eat.
What steps would you take to eliminate potential restaurants and eventually
choose the best option?
ECE 490: Introduction to ML 166
Decision Trees
Your decision process might look like this:
Is it close
by?
Yes No
Does it include a
Eliminated
student discount?
Yes No
Is it crowded? Eliminated
Yes No
Chosen Eliminated
ECE 490: Introduction to ML 167
Decision Trees
Decision trees work in a similar manner where
ECE 490: Introduction to ML 168
Thank you
ECE 490: Introduction to ML 169