R22 Machine Learning Digital Notes Final
R22 Machine Learning Digital Notes Final
MR22- 1CS0204
II B.Tech (AI & ML) II SEM
Digital Notes
MALLAREDDY UNIVERSITY
II B.Tech II Semester
L/T/P/C
3/0/2/4
[Reference 1]
Case Study: Predicting Customer Churn Using Machine Learning: A Case Study in the
Telecom Industry
Supervised Learning: regression and classification problems, simple linear regression, multiple
linear regression, ridge regression, logistic regression, k-nearest neighbor, naive Bayes classifier,
linear discriminant analysis, support vector machine, decision trees, bias-variance trade-off,
cross-validation methods such as leave-one-out (LOO) cross-validation, k-folds cross validation,
multi-layer perceptron, feed-forward neural network
[Reference 2]
Unit 3: Unsupervised Learning Algorithms
[Reference 2&3]
[Reference 2&3]
[Reference 3]
Reference Textbooks:
[1] "Mathematics for Machine Learning" by Marc Peter Deisenroth, A. Aldo Faisal, and
Cheng Soon
[2] Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron
[3] Introduction to Machine Learning with Python by Sebastian Raschka and Vahid
Mirjalili
Course Outcomes:
Contents
1. Introduction to machine learning ................................................................................................ 1
Types of Machine Learning: ........................................................................................................... 3
Examples of Machine Learning Applications: ............................................................................... 5
2. Essential Linear Algebra Concepts in Machine Learning: ......................................................... 6
3. Essential Probability and Statistics Concepts in Machine Learning:.......................................... 8
4. Essential Calculus Concepts in Machine Learning:.................................................................. 10
5.Optimization .............................................................................................................................. 12
Python code for Unit 1 .................................................................................................................. 14
Summary: ...................................................................................................................................... 25
1
Key Concepts of Machine Learning:
1. Data:
2. Training:
In supervised learning, a model learns from labeled data, where features (input
variables) are used to predict labels (output variable). For instance, in a spam
email filter, features might include the words in an email, and the label would
indicate whether the email is spam or not.
2
4. Algorithms:
Various machine learning algorithms exist, each suited to different types of tasks.
Common algorithms include linear regression for predicting numerical values,
decision trees for classification, and neural networks for complex tasks like image
recognition.
After training, the model is tested on new, unseen data to evaluate its
performance. This step helps ensure that the model can generalize well to new
situations and make accurate predictions.
Definition: In supervised learning, the model is trained on a labeled dataset, where the input data
is paired with corresponding output labels. The goal is for the model to learn a mapping from
input features to the correct output by generalizing patterns from the training data.
Application: Once trained, the model can automatically classify incoming emails as
spam or not spam, helping filter unwanted messages.
2. Unsupervised Learning:
Definition: Unsupervised learning involves training a model on an unlabeled dataset, and the
algorithm explores the inherent structure in the data without predefined output labels. The goal is
often to find patterns, relationships, or groupings within the data.
3
Application: This can help businesses tailor marketing strategies for different customer
segments, providing more personalized experiences and improving overall customer
satisfaction.
3. Self-Supervised Learning:
4. Reinforcement Learning:
Application: The self-driving car learns from experiences, receiving rewards (positive)
for safe driving and penalties (negative) for traffic violations. Over time, the model
optimizes its behavior to navigate efficiently and safely.
These types of machine learning cover a broad spectrum of applications, showcasing the
versatility of these approaches in solving various real-world problems. Each type has its
strengths and is suited to different scenarios depending on the nature of the data and the desired
outcome.
4
Examples of Machine Learning Applications:
1. Image Recognition:
NLP models understand and generate human language, facilitating tasks like
language translation, sentiment analysis, and chatbot interactions.
3. Predictive Analytics:
4. Healthcare Diagnostics:
5. Recommendation Systems:
Machine learning continues to advance rapidly, revolutionizing industries and shaping the way
we interact with technology. As the field evolves, it holds the potential to address increasingly
complex challenges and drive innovation across various domains.
5
2. Essential Linear Algebra Concepts in Machine Learning:
2.1. Vectors and Vector Spaces:
Vectors: Represent data points with both magnitude and direction (e.g., features in a
dataset).
Matrix Operations: Addition, multiplication, and other operations that manipulate and
analyze data (e.g., calculating distances, projections).
Functions that map vectors to other vectors while preserving linear relationships
(e.g., rotations, scaling).
Used in various algorithms like linear regression, neural networks, and dimensionality
reduction.
Solving systems of linear equations: Finding unknown values based on their linear
relationships (e.g., fitting a line to data points).
Eigenvectors: Directions along which the transformation has the same scaling factor
(e.g., principal axes in PCA).
6
Additional concepts:
Norm & Inner Product: Measuring the size and angle of vectors, respectively.
Orthogonality & Basis Vectors: Independent vectors representing the entire space.
Remember, these are just the foundation. Each concept has nuances and applications specific to
different ML algorithms.
Real-world applications:
Learning resources:
By mastering these concepts, you'll unlock the power of linear algebra for building effective and
insightful machine learning models!
7
3. Essential Probability and Statistics Concepts in Machine Learning:
Probability:
Random Variables: Represent uncertain quantities with possible outcomes and their
associated probabilities (e.g., coin flip outcome).
Conditional Probability: Probability of one event happening given another (e.g., chance
of rain given cloudy skies).
Bayes' Theorem: Updating beliefs about a hypothesis based on new evidence (e.g., spam
filter learning from user feedback).
Statistics:
Inferential Statistics: Drawing conclusions about populations from samples (e.g., testing
hypotheses about model accuracy).
Feature Engineering & Selection: Selecting informative features and preprocessing data
for better learning.
Decision Making & Risk Assessment: Assessing costs and benefits of different model
outputs and actions.
8
Anomaly Detection & Outlier Identification: Identifying unusual data points that may
indicate errors or new patterns.
Additional Concepts:
Remember, a solid understanding of probability and statistics is crucial for building robust,
reliable, and interpretable machine learning models.
9
4. Essential Calculus Concepts in Machine Learning:
Calculus, the study of change, plays a vital role in understanding and optimizing machine
learning models. Here are some key concepts:
1. Derivatives:
Rate of change of a function with respect to its input variable. Used for:
2. Partial Derivatives:
Rate of change of a multi-variable function with respect to one variable while holding
others constant. Used for:
4. Taylor Series:
Approximation of a function around a specific point using its derivatives. Used for:
10
5. Integration:
6. Convex Optimization:
Additional concepts:
Hessian Matrix: Second derivative matrix used for analyzing local optima and saddle
points.
Neural Networks: Learning complex relationships from data through gradient descent.
11
6. Optimization
Optimization is a critical component of machine learning, where the goal is to find the
best possible parameters for a model to achieve optimal performance. Here are some
important concepts in optimization in the context of machine learning:
1. Objective Function:
Definition: The objective function, also known as the loss function or cost
function, measures the performance of a model for a given set of parameters.
2. Optimization Algorithms:
3. Gradient Descent:
12
5. Learning Rate:
Definition: The learning rate is a hyperparameter that controls the size of the
steps taken during optimization. It influences the convergence speed and stability
of the algorithm.
7. Backpropagation:
9. Regularization:
13
10. Convergence Criteria:
Understanding and applying optimization concepts are crucial for efficiently training machine
learning models, especially in the context of deep learning, where optimization plays a central
role in learning hierarchical representations.
14
o Program 2: Implement a program to calculate the eigenvalues and
eigenvectors of a matrix.
15
o Program 3: Implement a program to solve a system of linear equations using
Gaussian elimination.
16
2. Probability and statistics:
17
o Program 2: Implement a program to calculate the probability of a certain
event occurring.
18
3. Calculus:
19
o Program 3: Implement a program to solve a differential equation.
20
4. Optimization:
21
o Program 2: Implement a program to find the maximum of a function using
Newton's method.
22
o Program 3: Implement a program to solve a linear programming problem.
23
24
Summary:
Introduction to Machine Learning: Machine learning is a field of artificial intelligence that
focuses on the development of algorithms and models that enable computers to learn and make
predictions or decisions without explicit programming. It encompasses various techniques,
including supervised learning, unsupervised learning, and reinforcement learning. Machine
learning applications range from image recognition and natural language processing to
recommendation systems and autonomous vehicles.
Linear Algebra in Machine Learning: Linear algebra plays a crucial role in machine learning
as it provides the foundation for many operations and concepts. Vectors, matrices, and linear
transformations are fundamental components used in representing data, defining models, and
performing operations like matrix multiplication, eigendecomposition, and solving linear
equations. Linear algebra is essential for understanding algorithms like linear regression,
principal component analysis, and neural networks.
Probability and Statistics in Machine Learning: Probability and statistics are essential tools in
machine learning for dealing with uncertainty, making predictions, and drawing inferences from
data. Probability theory is used to model uncertainty, while statistical methods help in analyzing
and interpreting data. Concepts like probability distributions, statistical tests, confidence
intervals, and hypothesis testing are critical for understanding the behavior of models and
making informed decisions in machine learning.
Optimization in Machine Learning: Optimization involves finding the best possible values for
the parameters of a model to achieve optimal performance. In machine learning, optimization
techniques are employed to minimize or maximize objective functions, often representing the
error or cost of a model. Gradient descent is a common optimization algorithm, and variations
like stochastic gradient descent are widely used for training models. Optimization is crucial for
fine-tuning models and improving their accuracy and efficiency.
25
Unit 2: Supervised Learning Algorithms
Supervised Learning: regression and classification problems, simple linear regression,
multiple linear regression, ridge regression, logistic regression, k-nearest neighbor, naive
Bayes classifier, linear discriminant analysis, support vector machine, decision trees, bias-
variance trade-off, cross-validation methods such as leave-one-out (LOO) cross-validation, k-
folds cross validation, multi-layer perceptron, feed-forward neural network
Case Study: Predicting House Prices: A Comparative Study of Supervised Learning
Algorithms.
Regression is a statistical process for estimating the relationships between the dependent
variables or criterion variables and one or more independent variables or predictors.
Regression analysis explains the changes in criteria in relation to changes in select
predictors. The conditional expectation of the criteria is based on predictors where the
average value of the dependent variables is given when the independent variables are
changed. Three major uses for regression analysis are determining the strength of predictors,
forecasting an effect, and trend forecasting.
Classification is a central topic in machine learning that has to do with teaching machines how
to group together data by particular criteria. Classification is the process where computers
group data together based on predetermined characteristics — this is called supervised learning.
Classification is simply grouping things together according to similar features and attributes.
When you go to a grocery store, you can fairly accurately group the foods by food group
(grains, fruit, vegetables, meat, etc.) In machine learning, classification is all about teaching
computers to do the same.
Examples of classification is as follows:
Classifying Images, Speech Tagging, Music Identification.
A common example of classification comes with detecting spam emails. To write a program
to filter out spam emails, a computer programmer can train a machine learning algorithm with
a set of spam-like emails labelled as spam and regular emails labelled as not-spam. The idea
is to make an algorithm that can learn characteristics of spam emails from this training set so
that it can filter out spam emails when it encounters new emails.
Simple Linear regression is used for predictive analysis. Linear regression is a linear
approach for modelling the relationship between the criterion or the scalar response and the
multiple predictors or explanatory variables. Linear regression focuses on the conditional
probability distribution of the response given the values of the predictors. For linear
regression, there is a danger of overfitting. The formula for linear regression is: Y’ = bX + A.
1
Ordinary Least Square (OLS) Method for Linear Regression
The ordinary least square method (OLS) for simple linear regression. If you are new to linear
regression, It will help us to understand how simple linear regression works step-by-step.
The simple linear regression is a model with a single regressor (independent variable)
x that has a relationship with a response (dependent or target) y that is a
y = β0 + β1 x + ε — — — — — — — — — — (1)
Where β0: intercept
β1: slope (unknown constant)
ε: random error component
This is a line where y is the dependent variable we want to predict, x is the independent
variable, and β0 and β1 are the coefficients that we need to estimate.
Estimation of β0 and β1 :
The OLS method is used to estimate β0 and β1. The OLS method seeks to minimize the sum
of the squared residuals. This means from the given data we calculate the distance from each
data point to the regression line, square it, and the sum of all of the squared errors together.
The equation (2) is a sample regression model, written in terms of the n pairs of data (yi, xi)
where (i = 1, 2, ,..,n).
2
Let’s take a simple example. This table shows some data from the manufacturing company.
Each row in the table shows the sales for a year and the amount spent on advertising that year.
Here our target variable is sales-which we want to predict.
Linear Regression estimates that Sales = β0 + β1 x
3. Multiply the error for each x with the error for each y and calculate the sum of
these multiplications
3
4. Square the residual of each x value from the mean and sum of these
squared values
4
Example of Simple Linear Regression:-
In this very simple example, we'll explore how to create a very simple fit line, the classic case
of y=mx+b. We'll go carefully through each step, so you can see what type of question a simple
fit line can answer.
5
6
7
Performance Evaluation
1. SSE-Sum of Squares Error
The error or residual is the difference between the actual value and the predicted
value. The sum of all errors can cancel out since it can contain negative signs and
give zero. So, we square all the errors and sum it up. The line which gives us the
least sum of squared errors is the best fit. The line of best fit always goes through x̅
and ȳ.
In Linear Regression, the line of best fit is calculated by minimizing the error (the
distance between data points and the line).
Sum of Squares Errors is also known as Residual error or Residual sum of squares
8
9
RMSE -Root Mean Squared Error
RMSE is a measure of how spread out these residuals are.
In other words, it tells you how concentrated the data is around the line of best fit.
RMSE is calculated by taking the square root of MSE.
Interpretation of RMSE:
RMSE is interpreted as the standard deviation of unexplained variance (MSE).
RMSE contains the same units as the dependent variable.
Lower values of RMSE indicates a better fit.
• A linear relationship should exist between the Target and predictor variables.
• The regression residuals must be normally distributed.
• Multiple Linear Regression assumes little or no multicollinearity (correlation between
the independent variable) in data.
10
Polynomial Regression is a regression algorithm that models the relationship between a
dependent(y) and independent variable(x) as nth degree polynomial.
▪ It is also called the special case of Multiple Linear Regression in ML. Because we add
some polynomial terms to the Multiple Linear regression equation to convert it into
Polynomial Regression.
▪ It is a linear model with some modification in order to increase the accuracy.
▪ The dataset used in Polynomial regression for training is of non-linear nature.
▪ It makes use of a linear regression model to fit the complicated and non-linear functions
and datasets.
▪ Hence, "In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modeled using a linear
model."
If we apply a linear model on a linear dataset, then it provides us a good result as we have
seen in Simple Linear Regression, but if we apply the same model without any modification
on a non-linear dataset, then it will produce a drastic output. Due to which loss function will
increase, the error rate will be high, and accuracy will be decreased.
So for such cases, where data points are arranged in a non-linear fashion, we need the
Polynomial Regression model.
11
Logistic Regression :-
Logistic Regression is one of the supervised machine learning algorithms used for
classification. In logistic regression, the dependent variable is categorical.
The objective of the model is, given the independent variables, what is the class
likely to be? [For binary classification, 0 or 1]
1. If z → -∞, sigmoid(z) → 0
2. If z → ∞ , sigmoid(z) → 1
3. If z=0, sigmoid(z)=0.5
Sigmoid Curve
12
So, if we input the linear model to the sigmoid function, it will convert the input
between range 0 and 1
In Linear regression, the predicted value of y is calculated by using the below
equation.
13
1. If y=ŷ, Error should be zero
2. The error should be very high for misclassification 3.
The error should be greater than or equal to zero.
Let’s check whether these properties hold good for the log loss or binary cross-
entropy function.
14
The error tends to be very high for misclassified data points.
3.The error should be greater than or equal to zero.
→ y is either o or 1
→ ŷ is always between 0 and 1
→ ln ŷ is negative and ln (1-ŷ) is negative
→ negative sign before the expression is included to make the error positive
[ In linear regression least-squares method, we will be squaring the error]
So, the error will be always greater than or equal to zero.
To interpret the model coefficient, we need to know the terms odds, log odds, odds
ratio.
Odds
Odds is defined as the probability of an event occurring divided by the probability
of the event not occurring.
15
Log odds (Logit Function)
Log odds =ln(p/1-p)
After applying the sigmoid function, we know that
So, we can convert the logistic regression as a linear function by using log odds.
Odds Ratio
Odds Ratio is the ratio of two odds
16
Odds Ratio in Logistic Regression
The odds ratio of an independent variable in logistic regression depends on how
that odds change with one unit increase in that particular variable by keeping all
the other independent variables constant.
β1 → Change in log-odds associated with
variable X1. The odds Ratio for variable X1 is the
exponential of β1
Ridge Regression:-
Ridge regression is a model tuning method that is used to analyse any data that
suffers from multicollinearity. This method performs L2 regularization. When the
issue of multicollinearity occurs, least-squares are unbiased, and variances are large,
this results in predicted values being far away from the actual values.
17
Lambda is the penalty term. λ given here is denoted by an alpha parameter in the ridge function.
So, by changing the values of alpha, we are controlling the penalty term. The higher the values
of alpha, the bigger is the penalty and therefore the magnitude of coefficients is reduced.
• It shrinks the parameters. Therefore, it is used to prevent multicollinearity
• It reduces the model complexity by coefficient shrinkage
• Check out the free course on regression analysis.
For any type of regression machine learning model, the usual regression equation forms the
base which is written as:
Y = XB + e
Where Y is the dependent variable, X represents the independent variables, B is the regression
coefficients to be estimated, and e represents the errors are residuals.
Once we add the lambda function to this equation, the variance that is not evaluated by the
general model is considered. After the data is ready and identified to be part of L2
regularization, there are steps that one can undertake.
Standardization:
In ridge regression, the first step is to standardize the variables (both dependent and
independent) by subtracting their means and dividing by their standard deviations. This causes
a challenge in notation since we must somehow indicate whether the variables in a particular
formula are standardized or not. As far as standardization is concerned, all ridge regression
calculations are based on standardized variables. When the final regression coefficients are
displayed, they are adjusted back into their original scale.
However, the ridge trace is on a standardized scale.
Bias and variance trade-off is generally complicated when it comes to building ridge regression
models on an actual dataset. However, following the general trend which one needs to
remember is:
The assumptions of ridge regression are the same as that of linear regression: linearity, constant
variance, and independence. However, as ridge regression does not provide confidence limits,
the distribution of errors to be normal need not be assumed.
Example:
We shall consider a data set on Food restaurants trying to find the best combination of food
items to improve their sales in a particular region.
K-Nearest Neighbor(KNN) :-
18
• K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a
well suite category by using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
• It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
• Example: Suppose, we have an image of a creature that looks similar to cat and dog,
but we want to know either it is a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure. Our KNN model will find the
similar features of the new data set to the cats and dogs images and based on the most
similar features it will put it in either cat or dog category.
19
K-NN Algorithm:
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:
• Firstly, we will choose the number of neighbors, so we will choose the k=5.
• Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:
20
• By calculating the Euclidean distance, we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
• As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
Below are some points to remember while selecting the value of K in the K-NN algorithm:
• There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
• Large values for K are good, but it may find some difficulties.
21
Advantages of KNN Algorithm:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.
• Always needs to determine the value of K which may be complex some time.
• The computation cost is high because of calculating the distance between the data points
for all the training samples.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
• Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
• Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
• The formula for Bayes' theorem is given as:
Where,
22
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Working of Naïve Bayes' Classifier can be understood with the help of the
below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So, using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:
Problem: If the weather is sunny, then the Player should play or not?
23
24
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5.
P(No)= 0.29
P(Sunny)= 0.35
• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other Algorithms.
• It is the most popular choice for text classification problems.
• Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.
There are three types of Naive Bayes Model, which are given below:
25
• Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes
that these values are sampled from the Gaussian distribution.
• Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
• Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word
is present or not in a document. This model is also famous for document classification
tasks.
For example, we have two classes and we need to separate them efficiently. Classes can have
multiple features. Using only a single feature to classify them may result in some overlapping
as shown in the below figure. So, we will keep on increasing the number of features for proper
classification.
Example:
Suppose we have two sets of data points belonging to two different classes that we want to
classify. As shown in the given 2D graph, when the data points are plotted on the 2D plane,
there’s no straight line that can separate the two classes of the data points completely. Hence,
in this case, LDA (Linear Discriminant Analysis) is used which reduces the 2D graph into a
1D graph in order to maximize the separability between the two classes.
26
Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis and
projects data onto a new axis in a way to maximize the separation of the two categories and
hence, reducing the 2D graph into a 1D graph.
27
In the above graph, it can be seen that a new axis (in red) is generated and plotted in the 2D
graph such that it maximizes the distance between the means of the two classes and minimizes
the variation within each class. In simple terms, this newly generated axis increases the
separation between the data points of the two classes. After generating this new axis using the
above-mentioned criteria, all the data points of the classes are plotted on this new axis and are
shown in the figure given below.
• Logistic Regression is one of the most popular classification algorithms that perform
well for binary classification but falls short in the case of multiple classification
problems with well-separated classes. At the same time, LDA handles these quite
efficiently.
• LDA can also be used in data pre-processing to reduce the number of features, just as
PCA, which reduces the computing cost significantly.
• LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to extract
useful data from different faces. Coupled with eigenfaces, it produces effective results.
But LDA also fails in some cases where the Mean of the distributions is shared. In this case,
LDA fails to create a new axis that makes both the classes linearly separable.
Extensions to LDA:
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of
variance (or covariance when there are multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs are
used such as splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the
estimate of the variance (actually covariance), moderating the influence of different
variables on LDA.
Some of the common Real-world Applications of Linear discriminant Analysis are given
below:
• Face Recognition
Face recognition is the popular application of computer vision, where each face is
represented as the combination of a number of pixel values. In this case, LDA is used
to minimize the number of features to a manageable number before going through the
classification process. It generates a new template in which each dimension consists of
a linear combination of pixel values. If a linear combination is generated using Fisher's
linear discriminant, then it is called Fisher's face.
28
• Medical
In the medical field, LDA has a great application in classifying the patient disease on
the basis of various parameters of patient health and the medical treatment which is
going on. On such parameters, it classifies disease as mild, moderate, or severe. This
classification helps the doctors in either increasing or decreasing the pace of the
treatment.
• Customer Identification
In customer identification, LDA is currently being applied. It means with the help of
LDA; we can easily identify and select the features that can specify the group of
customers who are likely to purchase a specific product in a shopping mall. This can be
helpful when we want to identify a group of customers who mostly purchase a product
in a shopping mall.
• For Predictions
LDA can also be used for making predictions and so in decision making. For example,
"will you buy this product” will give a predicted result of either one or two possible
classes as a buying or not.
• In Learning
Nowadays, robots are being trained for learning and talking to simulate human work,
and it can also be considered a classification problem. In this case, LDA builds similar
groups on the basis of different parameters, including pitches, frequencies, sound,
tunes, etc.
29
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it
can learn about different features of cats and dogs, and then we test it with this strange creature.
So as support vector creates a decision boundary between these two data (cat and dog) and
choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis
of the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there
are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
30
Types of SVM
• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is termed
as non-linear data and classifier used is called as Non-linear SVM classifier.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have
a dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We
want a classifier that can classify the pair (x1, x2) of coordinates in either green or blue.
Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
31
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from
both the classes. These points are called support vectors. The distance between the vectors and
the hyperplane is called as margin. And the goal of SVM is to maximize this margin. The
hyperplane with maximum margin is called the optimal hyperplane.
32
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So, to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third-dimension z. It can be
calculated as: z=x2 +y2
By adding the third dimension, the sample space will become as below image:
33
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
34
Decision Tree: -
Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
• The decisions or the test are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
• In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
• A decision tree can contain categorical data (YES/NO) as well as numeric data.
• Below diagram explains the general structure of a decision tree:
35
Use of Decision Trees:-
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
• Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
• The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
• Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:
• Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
• Step-3: Divide the S into subsets that contains possible values for the best attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node. Finally,
the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider the
below diagram:
36
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
• Information Gain
• Gini Index
1. Information Gain:
37
Where,
2. Gini Index:
• Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
• An attribute with the low Gini index should be preferred as compared to the high Gini
index.
• It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
• Gini index can be calculated using the below formula:
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of tree
pruning technology used:
• It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem.
• There is less requirement of data cleaning compared to other algorithms.
Bias-Variance trade-off: -
It is important to understand prediction errors (bias and variance) when it comes to accuracy in
any machine-learning algorithm. There is a trade-off between a model’s ability to minimize
38
bias and variance which is referred to as the best solution for selecting a value of
Regularization constant. A proper understanding of these errors would help to avoid the
overfitting and underfitting of a data set while training the algorithm.
Bias: The bias is known as the difference between the prediction of the values by the Machine
Learning model and the correct value. Being high in biasing gives a large error in training as
well as testing data. It recommended that an algorithm should always be low-biased to avoid
the problem of underfitting. By high bias, the data predicted is in a straight-line format, thus
not fitting accurately in the data in the data set. Such fitting is known as the Underfitting of
Data. This happens when the hypothesis is too simple or linear in nature. Consider the graph
given below for an example of such a situation.
Variance: - The variability of model prediction for a given data point which tells us the spread
of our data is called the variance of the model. The model with high variance has a very
complex fit to the training data and thus is not able to fit accurately on the data which it hasn’t
seen before. As a result, such models perform very well on training data but have high error
rates on test data. When a model is high on variance, it is then said to as Overfitting of Data.
Overfitting is fitting the training set accurately via complex curve and high order hypothesis
but is not the solution as the error with unseen data is high. While training a data model variance
should be kept low.
39
The high variance data looks as follows:
If the algorithm is too simple (hypothesis with linear equation) then it may be on high bias and
low variance condition and thus is error-prone. If algorithms fit too complex (hypothesis with
high degree equation) then it may be on high variance and low bias. In the latter condition, the
new entries will not perform well. Well, there is something between both of these conditions,
known as a Trade-off or Bias Variance Trade-off. This trade-off in complexity is why there is
a trade-off between bias and variance. An algorithm can’t be more complex and less complex
at the same time. For the graph, the perfect tradeoff will be like this.
40
We try to optimize the value of the total error for the model by using the Bias-Variance
Tradeoff.
The best fit will be given by the hypothesis on the trade-off point. The error to complexity
graph to show trade-off is given as –
This is referred to as the best point chosen for the training of the algorithm which gives low
error in training as well as testing data.
41
Introduction to Cross-Validation
However, the train-split method has certain limitations. When the dataset is small, the
method is prone to high variance. Due to the random partition, the results can be entirely
different for different test sets. Why? Because in some partitions, samples that are easy to
classify get into the test set, while in others, the test set receives the ‘difficult’ ones.
To deal with this issue, we use cross-validation to evaluate the performance of a machine
learning model. In cross-validation, we don’t divide the dataset into training and test sets only
once. Instead, we repeatedly partition the dataset into smaller groups and then average the
performance in each group. That way, we reduce the impact of partition randomness on the
results.
Many cross-validation techniques define different ways to divide the dataset at hand. We’ll
discuss about the two most frequently used: the k-fold and the leave-one-out methods.
The final performance estimate is the average of the six individual scores:
42
Score1+ Score1+ Score2+ Score3+ Score4+ Score5+ Score6
Overall Score = ---------------------------------------------------------------------------------------------------------
6
2. K-Fold Cross-Validation
In k-fold cross-validation, we first divide our dataset into k equally sized subsets. Then, we
repeat the train-test method k times such that each time one of the k subsets is used as a
test set and the rest k-1 subsets are used together as a training set. Finally, we compute the
estimate of the model’s performance estimate by averaging the scores over the k trials.
For example, let’s suppose that we have a dataset S={ x1, x2, x3, x4, x5, x6}containing 6 samples
and that we want to perform a 3-fold cross-validation.
Then, we train and evaluate our machine-learning model 3 times. Each time, two subsets form
the training set, while the remaining one acts as the test set. In our example:
Finally, the overall performance is the average of the model’s performance scores on those
three test sets:
43
in a large dataset, it is sufficient to use less than n folds since the test folds are large enough for
the estimates to be sufficiently precise.
Multi-layer perception is also known as MLP. It is fully connected dense layers, which
transform any input dimension to the desired dimension. A multi-layer perception is a neural
network that has multiple layers. To create a neural network, we combine neurons together so
that the outputs of some neurons are inputs of other neurons.
A multi-layer perceptron has one input layer and for each input, there is one neuron (or node),
it has one output layer with a single node for each output and it can have any number of hidden
layers and each hidden layer can have any number of nodes. A multilayer perceptron (MLP)
Neural network belongs to the feedforward neural network. It is an Artificial Neural Network
in which all nodes are interconnected with nodes of different layers.
The word Perceptron was first defined by Frank Rosenblatt in his perceptron program.
Perceptron is a basic unit of an artificial neural network that defines the artificial neuron in the
neural network. It is a supervised learning algorithm that contains nodes’ values, activation
functions, inputs, and node weights to calculate the output.
The Multilayer Perceptron (MLP) Neural Network works only in the forward direction. All
nodes are fully connected to the network. Each node passes its value to the coming node only
in the forward direction. The MLP neural network uses a Backpropagation algorithm to
increase the accuracy of the training model.
44
number of input nodes depends on the number of dataset features. Each input vector variable
is distributed to each of the nodes of the hidden layer.
Hidden Layer
It is the heart of all Artificial neural networks. This layer comprises all computations of the
neural network. The edges of the hidden layer have weights multiplied by the node values. This
layer uses the activation function.
There can be one or two hidden layers in the model. Several hidden layer nodes should be
accurate as few nodes in the hidden layer make the model unable to work efficiently with
complex data. More nodes will result in an overfitting problem.
Output Layer
This layer gives the estimated output of the Neural Network. The number of nodes in the output
layer depends on the type of problem. For a single targeted variable, use one node. N
classification problem, ANN uses N nodes in the output layer.
Every node in the multi-layer perception uses a sigmoid activation function. The sigmoid
activation function takes real values as input and converts them to numbers between 0 and 1
using the sigmoid formula.
1. MultiLayer Perceptron Neural Networks can easily work with non-linear problems.
2. It can handle complex problems while dealing with large datasets.
3. Developers use this model to deal with the fitness problem of Neural Networks.
4. It has a higher accuracy rate and reduces prediction error by using backpropagation.
5. After training the model, the Multilayer Perceptron Neural Network quickly predicts
the output.
1. This Neural Network consists of large computation, which sometimes increases the
overall cost of the model.
2. The model will perform well only when it is trained perfectly.
3. Due to this model’s tight connections, the number of parameters and node redundancy
increases.
45
Feed-Forward Neural Network:
A Feed Forward Neural Network is an artificial Neural Network in which the nodes are
connected circularly. A feed-forward neural network, in which some routes are cycled, is the
polar opposite of a Recurrent Neural Network. The feed-forward model is the basic type of
neural network because the input is only processed in one direction. The data always flows in
one direction and never backwards/opposite.
a Feed-Forward Neural Network is a single layer perceptron. A sequence of inputs enter the
layer and are multiplied by the weights in this model. The weighted input values are then
summed together to form a total. If the sum of the values is more than a predetermined
threshold, which is normally set at zero, the output value is usually 1, and if the sum is less
than the threshold, the output value is usually -1. The single-layer perceptron is a popular feed-
forward neural network model that is frequently used for classification. Single-layer
perceptrons can also contain machine learning features.
The neural network can compare the outputs of its nodes with the desired values using a
property known as the delta rule, allowing the network to alter its weights through training to
create more accurate output values. This training and learning procedure results in gradient
descent. The technique of updating weights in multi-layered perceptrons is virtually the same,
however, the process is referred to as back-propagation. In such circumstances, the output
values provided by the final layer are used to alter each hidden layer inside the network.
46
Unit 3: Unsupervised Learning Algorithms
Unsupervised Learning: clustering algorithms, k-means/k-medoid, hierarchical clustering, top-
down, bottom-up: single-linkage, multiple linkage, dimensionality reduction, principal
component analysis.
Case study: Customer Segmentation for E-commerce: An Unsupervised Learning
Contents
Unsupervised learning .......................................................................................................................................................... 1
Clustering algorithms ............................................................................................................................................................ 2
K-Means algorithm.................................................................................................................................................................. 3
Hierarchical clustering ........................................................................................................................................................ 13
Top-Down (Divisive) and Bottom-Up (Agglomerative) Hierarchical Clustering:....................................... 15
Single-Linkage and Multiple-Linkage: .......................................................................................................................... 16
Dimensionality reduction................................................................................................................................................... 18
Case Study: Customer Segmentation for E-commerce using Unsupervised Learning ............................. 34
Unsupervised learning
Unsupervised learning is a type of machine learning where the algorithm is given data without
explicit instructions on what to do with it. The system tries to learn the patterns and relationships
within the data on its own. There are two main types of unsupervised learning: clustering and
dimensionality reduction.
1. Clustering:
K-Means Clustering: This algorithm partitions data into 'k' clusters based on
similarity. For example, in customer segmentation, you can use K-means to group
customers with similar purchasing behavior.
Hierarchical Clustering: It creates a tree of clusters, where the root is a single
cluster containing all data points, and the leaves are individual data points. This
can be used in taxonomy or gene expression analysis.
2. Dimensionality Reduction:
Principal Component Analysis (PCA): PCA is used to reduce the number of
features in a dataset while retaining its essential information. It's often applied in
image compression or feature extraction for machine learning models.
t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is used for
visualizing high-dimensional data in two or three dimensions. It's useful for
exploring the structure of data and finding patterns. For instance, visualizing
similarities between words in natural language processing.
3. Association Rule Learning:
Apriori Algorithm: This algorithm is used to discover associations between
different items in a dataset. For example, in retail, it can identify relationships like
"Customers who buy product A are likely to buy product B."
1
4. Generative Models:
Generative Adversarial Networks (GANs): GANs consist of a generator and a
discriminator that are trained together. GANs can generate new data instances that
resemble the training data. They are used in image and video generation tasks.
5. Autoencoders:
Variational Autoencoders (VAE): VAEs are a type of autoencoder that learns a
probabilistic mapping between the data space and a latent space. They are used
for generating new data points and can be applied to image and text generation.
Unsupervised learning is particularly valuable when you have a large amount of unlabeled data
and want to explore the underlying structure or relationships within it. It is widely used in
various domains, including pattern recognition, anomaly detection, and feature learning.
Clustering algorithms
Clustering algorithms in unsupervised learning aim to group similar data points together into
clusters or segments based on certain criteria. The goal is to discover hidden patterns or
structures within the data. There are various clustering algorithms, each with its own approach
and characteristics. Here are a few commonly used clustering algorithms:
1. K-Means Clustering:
Objective: Partition the data into 'k' clusters based on similarity.
Process: It starts by randomly selecting 'k' centroids (cluster centers) and assigns
each data point to the nearest centroid. Then, it recalculates the centroids based on
the mean of the data points in each cluster. This process iterates until
convergence.
Example: Customer segmentation in marketing based on purchasing behavior.
2. Hierarchical Clustering:
Objective: Create a tree-like structure (dendrogram) of clusters.
Process: It begins with each data point as a separate cluster and merges the
closest clusters iteratively until all points belong to a single cluster. The resulting
dendrogram can be cut at different levels to obtain clusters of varying sizes.
Example: Taxonomy in biology or organizational hierarchy.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
Objective: Identify clusters based on the density of data points.
Process: It defines clusters as dense regions separated by areas of lower point
density. It classifies points as core points, border points, or outliers (noise) based
on their density and proximity to other points.
Example: Anomaly detection in network traffic where unusual patterns represent
potential security threats.
4. Mean Shift:
Objective: Discover modes or peaks of high-density regions.
Process: It iteratively shifts the center of a kernel until it converges to a high-
density region. The algorithm is adaptive and does not require specifying the
number of clusters beforehand.
Example: Image segmentation where pixels with similar colors form clusters.
2
5. Agglomerative Clustering:
Objective: Build clusters by successively merging or agglomerating data points.
Process: It starts with each data point as a singleton cluster and merges the closest
pairs iteratively until a stopping criterion is met. The result is a dendrogram that
can be cut to form clusters.
Example: Social network analysis to identify communities within a network.
6. Gaussian Mixture Model (GMM):
Objective: Model the data as a mixture of several Gaussian distributions.
Process: It assumes that the data is generated by a mixture of several Gaussian
distributions. The algorithm estimates the parameters of these distributions,
including means and covariances, to identify clusters.
Example: Speech and handwriting recognition where multiple patterns contribute
to the observed data.
K-Means algorithm
The K-Means algorithm is a popular unsupervised machine learning algorithm used for
clustering data. The goal of K-Means is to partition a dataset into 'k' clusters, where each data
point belongs to the cluster with the nearest mean. Here's a detailed explanation of the K-Means
algorithm:
Key Concepts:
Unsupervised Learning: It works with unlabeled data, finding patterns on its own.
Clustering: It groups similar data points together into distinct clusters.
Centroids: Each cluster has a central point (centroid), representing the "average" of data
points within it.
Distance Metric: It measures how close data points are to each other, usually using
Euclidean distance.
Algorithm Steps:
Step 1: Initialization
Input: Dataset with 'n' data points and the desired number of clusters 'k'.
Process:
Randomly select 'k' data points from the dataset as initial centroids.
Step 2: Assignment
Input: Initial centroids.
Process:
For each data point in the dataset, calculate the Euclidean distance to each
centroid.
Assign the data point to the cluster associated with the nearest centroid.
3
Step 3: Update
Input: Assigned clusters.
Process:
Recalculate the centroids of each cluster by computing the mean of all data points
in that cluster.
Pseudo code:
1. Randomly initialize k centroids: c_1, c_2, ..., c_k
2. Repeat until convergence:
a. For each data point x_i, assign it to the cluster with the closest centroid:
j = argmin_k ||x_i - c_k||^2
b. Update centroids: c_k = (1/|cluster k|) * Σ x_i for all i in cluster k
4
Program 1: Implement a program to cluster a dataset using K-means clustering.
5
Output:
6
Output:
7
Example 3: Breast Cancer Wisconsin (Diagnostic) dataset
8
Output:
9
Program 2: Implement a program to calculate the elbow method for K-means clustering.
Output:
10
Program 3: Implement a program to calculate the silhouette coefficient for K-means
clustering.
11
Output:
Example:
Imagine a dataset of customer spending habits. K-means could group customers into clusters
based on similar spending patterns, helping businesses tailor marketing strategies.
Key Points:
Simple and Efficient: K-means is popular due to its simplicity and efficiency.
Sensitivity to Initialization: It can get stuck in local optima, so multiple runs with
different initializations are often recommended.
Non-Spherical Clusters and Outliers: It struggles with non-spherical cluster shapes or
outliers.
Choosing k: Selecting the optimal number of clusters is often challenging.
Applications:
Customer Segmentation: Identifying customer groups with similar characteristics.
Image Compression: Grouping similar pixels for compression.
Anomaly Detection: Finding unusual data points that don't fit into clusters.
Document Clustering: Grouping text documents based on content similarity.
Gene Expression Analysis: Identifying patterns in gene expression data.
12
Hierarchical clustering
Hierarchical clustering is a method used in unsupervised machine learning to group similar data
points into clusters in a hierarchical manner. It builds a tree-like structure, called a dendrogram,
to represent the relationships between clusters. The two main approaches to hierarchical
clustering are agglomerative (bottom-up) and divisive (top-down).
Example:
Let's consider a small dataset for illustrative purposes:
Data Points: A, B, C, D, E
Distances:
- Dist(A, B) = 2
- Dist(A, C) = 3
- Dist(A, D) = 4
- Dist(A, E) = 5
- Dist(B, C) = 1
- Dist(B, D) = 6
- Dist(B, E) = 7
- Dist(C, D) = 8
- Dist(C, E) = 9
- Dist(D, E) = 10
13
Dendrogram Interpretation:
The dendrogram would show the hierarchy of merging clusters at different heights, and we can
choose the height at which we want to cut the dendrogram to obtain clusters.
Divisive Hierarchical Clustering:
1. Step 1: Treat the entire dataset as a single cluster: {A, B, C, D, E}.
2. Step 2: Split the cluster into two: {A, B, C}, {D, E}.
3. Step 3: Split the cluster further: {A, B}, {C}, {D, E}.
4. Step 4: Split the cluster further: {A}, {B}, {C}, {D, E}.
5. Step 5: Split the cluster further: {A}, {B}, {C}, {D}, {E}.
Dendrogram Interpretation:
Similar to agglomerative clustering, the dendrogram in divisive clustering represents the
hierarchy of splitting clusters.
In practice, the choice of distance metric and linkage criteria (how to measure the distance
between clusters) influences the results of hierarchical clustering.
14
Top-Down (Divisive) and Bottom-Up (Agglomerative) Hierarchical
Clustering:
In the context of unsupervised learning algorithms, specifically hierarchical clustering, the terms
"top-down" and "bottom-up" refer to two different approaches for building the hierarchical
structure, and "single-linkage" and "multiple linkage" refer to different strategies for measuring
the distance between clusters.
Top-Down (Divisive) and Bottom-Up (Agglomerative) Hierarchical Clustering:
1. Top-Down (Divisive) Hierarchical Clustering:
This approach starts with the entire dataset as one cluster and recursively splits it
into smaller clusters until each data point forms a singleton cluster.
At each step, the algorithm selects a cluster and divides it into two.
2. Bottom-Up (Agglomerative) Hierarchical Clustering:
15
This approach starts with each data point as a singleton cluster and merges the
closest clusters until only one cluster remains.
At each step, the algorithm merges the two closest clusters.
16
17
Dimensionality reduction
Dimensionality reduction is a technique used in machine learning and data analysis to reduce the
number of input features or variables while preserving the important information in the data.
This is often done to mitigate the curse of dimensionality, improve computational efficiency, and
avoid over fitting. Here are some popular algorithms for dimensionality reduction:
Key Algorithms:
1. Principal Component Analysis (PCA) (1901):
Identifies orthogonal directions of maximum variance in the data.
Projects data onto these principal components, creating lower-dimensional
representations.
Widely used due to simplicity and effectiveness.
2. Linear Discriminant Analysis (LDA) (1936):
Supervised method that considers class labels during projection.
Maximizes class separability in the reduced space.
Often used for classification tasks.
3. Factor Analysis (1904):
Assumes underlying latent variables explain observed correlations between features.
Models these latent factors to reduce dimensionality.
Commonly used in social sciences and psychometrics.
4. t-Distributed Stochastic Neighbor Embedding (t-SNE) (2008):
Non-linear technique for visualizing high-dimensional data in 2D or 3D.
Preserves local structure while revealing global patterns.
Popular for visual exploration of complex datasets.
5. Singular Value Decomposition (SVD) (1873):
Matrix factorization technique with various applications, including dimensionality
reduction.
Decomposes a matrix into three matrices: U, Σ, and V*.
Truncated SVD approximates the original matrix with fewer dimensions.
6. Auto encoders (1980s - 1990s):
Neural networks trained to reconstruct their input data.
Learn compressed representations in the hidden layers.
Flexible for non-linear dimensionality reduction.
7. Independent Component Analysis (ICA) (1990s):
Finds independent components within the data.
Useful for signal separation and blind source separation.
8. Random Projection (2006):
Projects data onto a lower-dimensional subspace using random matrices.
Often computationally efficient and preserves pairwise distances well.
9. Uniform Manifold Approximation and Projection (UMAP) (2018):
Non-linear technique preserving global structure while revealing local patterns.
Often used for visualization and clustering, often outperforming t-SNE.
Choosing the Right Algorithm:
Data characteristics: linear vs. non-linear relationships, noise levels, etc.
Purpose: visualization, compression, classification, feature selection, etc.
Interpretability requirements: some algorithms yield more interpretable results.
Computational efficiency: consider algorithm speed and scalability.
18
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a dimensionality reduction technique widely
used in machine learning and data analysis. Its primary goal is to transform high-
dimensional data into a lower-dimensional representation while retaining as much of the
original variability as possible. PCA achieves this by identifying the directions (principal
components) in the data along which the variance is maximized.
Key Concepts:
1. Covariance Matrix:
PCA starts by computing the covariance matrix of the input data. The covariance
matrix captures the relationships between different features, indicating how they
vary together.
2. Eigenvalues and Eigenvectors:
PCA then calculates the eigenvalues and corresponding eigenvectors of the
covariance matrix. Eigenvectors represent the directions of maximum variance,
and eigenvalues indicate the magnitude of variance along those directions.
3. Principal Components:
The eigenvectors become the principal components of the data. These are the new
coordinate axes in the transformed space. The first principal component
corresponds to the direction of maximum variance, the second to the second-
highest variance, and so on.
4. Explained Variance:
Each eigenvalue represents the amount of variance captured by its corresponding
principal component. The total variance of the data remains constant, but PCA
allows for prioritizing the most important dimensions.
5. Dimensionality Reduction:
The principal components are ranked by their corresponding eigenvalues. By
selecting the top-k principal components (where k is the desired reduced
dimensionality), you can create a lower-dimensional representation of the data.
Steps in PCA:
1. Standardization:
Standardize the input features (subtract mean and divide by the standard
deviation) to ensure that each feature contributes equally to the analysis.
2. Covariance Matrix Calculation:
Calculate the covariance matrix of the standardized data.
3. Eigen decomposition:
Perform Eigen decomposition on the covariance matrix to obtain eigenvalues and
eigenvectors.
4. Select Principal Components:
Rank the eigenvalues in descending order and choose the top-k eigenvectors to
form the principal components matrix.
5. Data Transformation:
Project the original data onto the selected principal components to obtain the
lower-dimensional representation.
19
20
Output:
21
Program 1: Implement a program to perform principal component analysis on a dataset.
22
Output:
23
24
Output:
25
Program 3: Implement a program to calculate the singular value decomposition for a
dataset.
26
Hierarchical clustering:
Program 1: Implement a program to perform hierarchical clustering on a dataset.
27
Program 2: Implement a program to calculate the agglomerative clustering algorithm for a
hierarchical clustering.
28
Program 3: Implement a program to calculate the divisive clustering algorithm for a hierarchical
clustering.
29
Output:
30
Anomaly detection:
Program 1: Implement a program to detect anomalies in a dataset using the isolation forest
algorithm.
Output:
31
Program 2: Implement a program to detect anomalies in a dataset using the one-class support
vector machine.
Output:
32
Program 3: Implement a program to evaluate the performance of an anomaly detection
algorithm.
33
Case Study: Customer Segmentation for E-commerce using Unsupervised
Learning
Introduction:
In this case study, we will explore how unsupervised learning techniques, specifically clustering
algorithms, can be applied to perform customer segmentation for an e-commerce business.
Customer segmentation helps businesses understand the diverse needs and behaviors of their
customers, enabling personalized marketing strategies, product recommendations, and improved
customer experiences.
Objective:
The goal is to identify distinct groups of customers based on their purchasing behavior,
preferences, and engagement with the e-commerce platform. This segmentation can provide
valuable insights for targeted marketing campaigns, product recommendations, and tailored
services.
Dataset:
Assume we have a dataset with the following features:
1. Customer ID: Unique identifier for each customer.
2. Purchase History: Information about the products purchased, including frequency,
recency, and monetary value.
3. Website Engagement: Data related to customer engagement, such as time spent on the
website, number of visits, etc.
4. Demographic Information: Age, gender, location, etc.
Steps:
1. Data Preprocessing:
Handle missing values, if any.
Standardize or normalize numerical features.
Encode categorical variables.
2. Exploratory Data Analysis (EDA):
Understand the distribution of each feature.
Explore correlations between features.
Identify outliers and decide whether to handle or remove them.
3. Feature Engineering:
Create relevant features for analysis.
Combine or transform features if needed.
4. Unsupervised Learning (Clustering):
Apply clustering algorithms such as K-means, hierarchical clustering, or
DBSCAN.
Choose the appropriate number of clusters based on the data and business
understanding.
Analyze and interpret the results of clustering.
5. Customer Segmentation:
Identify and label the segments created by the clustering algorithm.
Analyze the characteristics of each segment.
Understand the differences and similarities between segments.
6. Business Insights:
Derive actionable insights for marketing, product development, and customer
engagement strategies based on the identified segments.
34
Tailor marketing campaigns to address the specific needs of each segment.
Optimize product recommendations and pricing strategies.
7. Evaluation:
Evaluate the effectiveness of customer segmentation by monitoring key
performance indicators (KPIs) over time.
Refine the segmentation approach if necessary.
Tools and Technologies:
Python for data preprocessing, analysis, and visualization.
Scikit-learn or other machine learning libraries for clustering algorithms.
Matplotlib and Seaborn for data visualization.
Conclusion:
Customer segmentation using unsupervised learning provides businesses with a powerful tool to
enhance customer understanding and optimize marketing strategies. By leveraging the insights
gained from segmentation, e-commerce businesses can foster customer loyalty, improve
customer satisfaction, and drive overall business success.
35
Unit 4: Ensemble Algorithms
Bagging, Boosting, Random forests, AdaBoost,
Gradient boosting.
Case Study: Credit Card Fraud Detection:
Ensemble Learning
Ensemble Methods
4.1 Introduction
• In broad terms, using ensemble methods is about combining models to an ensemble
such that the ensemble has a better performance than an individual model on average.
• The main categories of ensemble methods involve (1) voting schemes among high-
variance models to prevent “outlier” predictions and overfitting, and (2) the other
involves boosting “weak learners” to become “strong learners.”
Unanimity
Majority
Plurality
Figure 1: Illustration of unanimity, majority, and plurality voting
1
Training set
New data
Classification
models
h1 h2 ... hn
Predictions y1 y2 ... yn
Voting
Final prediction yf
Figure 2: Illustration of the majority voting concept. Here, assume n different classifiers,
{h1 , h2 , ..., hn }, where hi (x) = ŷi .
In lecture 2, (Nearest Neighbor Methods) we learned that the majority (or plurality) voting
can be expressed more simply as the mode:
The following illustration demonstrates why majority voting can be effective (under certain
assumptions).
• Given are n independent classifiers (h1 , ..., hn ) with a base error rate ;
• Here, independent means that the errors are uncorrelated;
• Assume a binary classification task (there are two unique class labels).
Assuming the error rate is better than random guessing (i.e., lower than 0.5 for binary
classification),
the error of the ensemble can be computed using a binomial probability distribution since
the ensemble makes a wrong prediction if more than 50% of the n classifiers make a wrong
prediction.
The probability that we make a wrong prediction via the ensemble if k classifiers predict
the same class label (where k > dn/2e because of majority voting ) is then
n k
P (k) = (1 − )n−k , (3)
k
n
where k is the binomial coefficient
n n!
= . (4)
k (n − k)!k!
However, we need to consider all cases k ∈ {dn/2e, ..., n} (cumulative prob. distribution) to
compute the ensemble error
n
X n
ens = k (1 − )n−k . (5)
k
k
Consider the following example with n=11 and = 0.25, where the ensemble error decreases
substantially compared to the error rate of the individual models:
11
X 11
ens = 0.25k (1 − 0.25)11−k = 0.034. (6)
k
k=6
0.6
Error rate
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Base error
Figure 3: error-rate
n
X
ŷ = arg max wi pi,j , (7)
j
i=1
where pi,j is the predicted class membership probability for class label j by the ith classifier.
Here wi is an optional weighting parameter. If we set
wi = 1/n, ∀wi ∈ {w1 , ..., wn },
then all probabilities are weighted uniformly.
To illustrate this, let us assume we have a binary classification problem with class labels
j ∈ {0, 1} and an ensemble of three classifiers hi (i ∈ {1, 2, 3}):
h1 (x) → [0.9, 0.1], h2 (x) → [0.8, 0.2], h3 (x) → [0.4, 0.6]. (8)
We can then calculate the individual class membership probabilities as follows:
4.4 Bagging
• Bagging relies on a concept similar to majority voting but uses the same learning
algorithm (typically a decision tree algorithm) to fit models on different subsets of the
training data (bootstrap samples).
• In a nutshell, a boostrap sample is a sample of size n drawn with replacement from
an original training set D with |D| = n. Consequently, some training examples are
duplicated in each bootstrap sample, and some other training examples do not appear
in a given bootstrap sample at all (usually, we refer to these examples as ”out-of-bag
sample”). This is illustrated in Figure 5. We will revisit bootstrapping later in the
model evaluation lectures.
• In the limit, there are approx. 63.2% unique examples in a given boostrap sample.
Consequently, 37.2% examples from the original training set won’t appear in a given
boostrap sample at all. The justification is illustrated below in Eq. 11 to Eq. 13.
• Bagging can improve the accuracy of unstable models that tend to overfit3 .
Algorithm 1 Bagging
1: Let n be the number of bootstrap samples
2:
3: for i=1 to n do
4: Draw bootstrap sample of size m, Di
5: Train base classifier hi on Di
6: ŷ = mode{h1 (x), ..., hn (x)}
Bootstrap 1 x8 x6 x2 x9 x5 x8 x1 x4 x8 x2 x3 x7 x10
Bootstrap 2 x10 x1 x3 x5 x1 x7 x4 x2 x1 x8 x6 x9
Bootstrap 3 x6 x5 x4 x1 x2 x4 x2 x6 x9 x2 x3 x7 x8 x10
n
1
P (not chosen) = 1− , (11)
n
Vice versa, we can then compute the probability that a sample is chosen as
n
1
P (chosen) = 1 − 1 − ≈ 0.632. (13)
n
h1 h2 hn
Training set
Bootstrap T1 Tn
samples T22
T ...
New data
Classification h1 h2 ... hn
models
Predictions y1 y2 ... yn
Voting
Final prediction yf
Figure 8: The concept of bagging. Here, assume n different classifiers, {h1 , h2 , ..., hn } where
hi (x) = ŷi .
(Accurate)
Low Bias
(Not Accurate)
High Bias
• One can say that individual, unpruned decision tree “have high variance” (in this
context, the individual decision trees “tend to overfit”); a bagging model has a lower
variance than the individual trees and is less prone to overfitting – again, bias and
variance decomposition will be discussed in more detail next lecture.
4.6 Boosting
• There are two broad categories of boosting: Adaptive boosting and gradient boosting.
• Adaptive and gradient boosting rely on the same concept of boosting “weak learners”
(such as decision tree stumps) to “strong learners.”
Weighted h 2(x)
Training Sample
Weighted h m(x)
Training Sample
(∑ j j )
m
h m(x) = sig n w h (x) for h (x) ∈ {−1,1}
j= 1
Intuitively, we can outline the general boosting procedure for AdaBoost is as follows:
• Loop:
– Apply weak learner to weighted training examples (instead of orig. training set,
may draw bootstrap samples with weighted probability)
– Increase weight for misclassified examples
• AdaBoost (short for ”adaptive boosting”) was initially described by Freund and Schapire
in 1997 .
The early stopping criterion if r > 1/2 then stop in Figure 11 assumes a binary classifi-
cation setting. Otherwise, it can be generalized to r > 1 − (1/c), where c is the number of
unique class labels in the training set. For the general case of c classes, you need to extend
the algorithm on line 10 by replacing
αr := log[(1 − r )/r ]
with
x2 x2
x1 x1
3 4
x2 x2
x1 x1
Figure 12: Illustration of the AdaBoost algorithms for three iterations on a toy dataset. The size
of the symbols (circles shall represent training examples from one class, and triangles shall represent
training examples from another class) is proportional to the weighting of the training examples at
each round. The 4th subpanel shows a combination/ensemble of the hypotheses from subpanels
1-3.
For the sake of simplicity, the following paragraphs will illustrate gradient boosting using
a regression (not classification) example. At the end of this section, we will briefly look
at how we can apply gradient boosting to decision tree classifiers – we won’t go into too
much detail, about classification specifics here, because we haven’t covered (log-)likelihood
maximization and logistic regression, yet.
Suppose you have the following dataset (Table 1):
Step 1. The first step is to construct the root node. As we remember from the decision tree
lecture, making a prediction based on all training examples at a given node in a decision tree
regressor is basically just computing the average target value. Hence, step 1 is computing
the prediction
n
1 X (i)
yˆ1 = y = 0.5875
n i=1
Step 2. For the second step, we first compute the so-called pseudo residuals. These pseudo
residuals are basically the difference between the true target value and the predicted target
value. For the regression case, the residual based on the output from step 1 is simply
r1 = y1 − yˆ1 .
Note that we use the term ”pseudo” residual, because there are different ways to compute
these differences and the term ”residual” has a specific meaning in regression analysis.
Now, after computing the residuals for all training examples, we can add a new column to
our dataset, which we refer to as r1 . This is shown in Table 2.
Table 2: Example dataset for illustrating gradient boosting, including the residuals after Step 1.
previous tree
Then, create a tree based on x1, . . . , xm to fit the residuals
x1# x2=City x3=Age y=Price r1=Residual
Rooms
5 Boston 30 1.5 1.5 - 0.5875 = 0.9125
10to fitMadison
The next part of step 2 is a new decision20 tree to the 0.58-. 0.5875
0.5 residuals = -0.0875
Note that we usually
limit the depth of this decision tree (e.g., setting a max depth, which is a hyperparameter).
6 Lansing 20 0.25 0.25 - 0.5875 = -0.3375
An example decision tree fit to these residuals may look like as shown in Figure 13.
5 Waunake 10 0.1 0.1 - 0.5875 = -0.4875
e
Age >= 30
No Yes
# Rooms >= 10 0.9125
-0.3375 -0.0875
-0.4125
-0.4875
Step 3. In this third step, we combine the tree from step 1 (the root node) and the tree
from step 2. For example, if we were to make a prediction for ”Lansing” based on the dataset
in Table 1, the prediction would be
Here, 0.5875 is the value predicted by the root node from step 1. And −0.4125 is the residual
(r1 value) from step 2. The coefficient α is a learning rate or step size parameter in the
range [0, 1] – it helps to not add the full residual but a rescaled version (a smaller ”step”).
A typical value for α would be 0.01 or 0.1, but this is also a hyperparameter that needs to
be tuned manually. Empirically, choosing an alpha value that is too large will likely result
in a gradient boosting model with high variance (a model that tends to overfit).
After step 3, we would now go back to step 2 and repeat the procedure (step 2 and step
3) T times. In each round, we take the predictions from step 3 and compute a new set
of residuals for fitting the next tree. For example, for the ”Lansing” example above with
alpha = 0.1, the new residual r2 would be
We would do this for all training examples and add a new column r2 to Table 2. Then, we
would fit a tree on these residuals, and so forth.
The general algorithm is summarized below.
Here, ri,t is the pseudo residual of the t-th tree and i-th example.
In a regression case, the differentiable loss function in Line 3 could be the squared error:
1 (i) 2
SSE 0 = y − h(x(i) ) .
2
(We add a 12 scaling factor because it cancels nicely in the partial derivative). Via the chain
rule, we can differentiate it as follows, with respect to the prediction:
Algorithm 1 Gradient Boosting
1: Initialize T : the number of trees for gradient boosting rounds
(i) (i) n
2: Initialize D: the training
dataset, {hx , y i}i=1
(i) (i)
3: Choose L y , h(x ) , a differentiable loss function
Pn (i)
4: Step 1: Initialize model h0 (x) = argmin i=1 L y , ŷ [root node]
ŷ
5: Step 2:
6: for t=1 to T do
∂L(y (i) ,h(x(i) ))
7: A. Compute pseudo residual ri,t = − ∂h(x(i) )
, for i = 1 to n
h(x)=ht−1 (x)
8: B. Fit tree to rit values, and create terminal nodes Rj,t for j = 1, ..., Jt .
9: C.
10: for j=1 to Jt do P
ŷj,t = argmin x(i) ∈Ri,j L y (i) , ht−1 (x(i) ) + ŷ
11:
ŷ
PJt
12: D. Update ht (x) = ht−1 (x) + α j=1 ŷj,t I x ∈ Rj,t
13: Step 3: Return ht (x)
∂ 1 (i) 2 1
y − h(x(i) ) = 2 × y (i) − h(x(i) ) × (0 − 1)
(i)
(14)
∂h(x ) 2 2
= − y − h(x(i) ) .
(i)
(15)
We can apply a similar concept for fitting a gradient boosting model for classification. The
main algorithm is the same, except that we use a different loss function (minimizing the
negative log-likelihood). Also, there are minor details for scaling the ”pseudo residuals.”
However, since we haven’t talked about logistic regression and negative log-likelihood opti-
mization, this is out-of-scope at this point.
4.7.1 Overview
• Random forests are among the most widely used machine learning algorithm, probably
due to their relatively good performance “out of the box” and ease of use (not much
tuning required to get good results).
• In the context of bagging, random forests are relatively easy to understand conceptu-
ally: the random forest algorithm can be understood as bagging with decision trees,
but instead of growing the decision trees by basing the splitting criterion on the com-
plete feature set, we use random feature subsets.
• To summarize, in random forests, we fit decision trees on different bootstrap samples,
and in addition, for each decision tree, we select a random subset of features at each
node to decide upon the optimal split; while the size of the feature subset to consider
at each node is a hyperparameter that we can tune, a “rule-of-thumb” suggestion is
to use NumFeatures = log2 m + 1.
4.7.2 Does random forest select a subset of features for every tree or every
node?
Earlier random decision forests by Tin Kam Ho9 used the “random subspace method,” where
each tree got a random subset of features.
“The essence of the method is to build multiple trees in randomly selected sub-
spaces of the feature space.” – Tin Kam Ho
However, a few years later, Leo Breiman described the procedure of selecting different subsets
of features for each node (while a tree was given the full set of features) — Leo Breiman’s
formulation has become the “trademark” random forest algorithm that we typically refer to
these days when we speak of “random forest10 :”
• The reason why random forests may work better in practice than a regular bagging
model, for example, may be explained by the additional randomization that further
diversifies the individual trees (i.e., decorrelates them).
• In Breiman’s random forest paper, the upper bound of the generalization error is given
as
ρ̄ · (1 − s2 )
PE ≤ , (16)
s2
where ρ̄ is the average correlation among trees and s measures the strength of the trees as
classifiers. I.e., the average predictive performance concerning the classifiers’ margin. We
do not need to get into the details of how p̄ and s are calculated to get an intuition for
their relationship. I.e., the lower correlation, the lower the error. Similarly, the higher the
strength of the ensemble or trees, the lower the error. So, randomization of the feature
subspaces may decrease the “strength” of the individual trees, but at the same time, it
reduces the correlation of the trees. Then, compared to bagging, random forests may be at
the sweet spot where the correlation and strength decreases result in a better net result.
While random forests are naturally less interpretable than individual decision trees, where
we can trace a decision via a rule sets, it is possible (and common) to compute the so-called
“feature importance” of the inputs – that means, we can infer how important a feature is for
the overall prediction. However, this is a topic that will be discussed later in the “Feature
Selection” lecture.
4.7.5 Extremely Randomized Trees (ExtraTrees)
• A few years after random forests were developed, an even “more random” procedure
was developed called Extremely Randomized Trees 11 .
• Compared to regular random forests, the ExtraTrees algorithm selects a random fea-
ture at each decision tree nodes for splitting; hence, it is very fast because there is no
information gain computation and feature comparison step.
• Intuitively, one might say that ExtraTrees have another “random component” (com-
pared to random forests) to further reduce the correlation among trees – however, it
might decrease the strength of the individual trees (if you think back of the general-
ization error bound discussed in the previous section on random forests).
4.8 Stacking
4.8.1 Overview
• In general, in stacking, we have “base learners” that learn from the initial training
set, and the resulting models then make predictions that serve as input features to a
“meta-learner.”
New data
Classification
models
h1 h2 ... hn
Predictions y1 y2 ... yn
Meta-Classifier
Final prediction
yf
Figure 15: This figure illustrates basic concept of stacking, which is relatively similar to the voting
classifier at the beginning of this lecture. Note that here, in contrast to majority voting, we have a
meta-classifier that takes the predictions of the models produced by the base learners (h1 ...hn ) as
inputs.
The problem with the naive stacking algorithm outlined in Fig. 14 and Fig. 15 is that it has a
high tendency to suffer from extensive overfitting. The reason for a potentially high degree
of overfitting is that if the base learners overfit, then the meta-classifier heavily relies on
these predictions made by the base-classifiers. A better alternative would be to use stacking
with k -fold cross-validation or leave-one-out cross-validation.
A Validation Training
Fold Fold
1st Performance 1
K Iterations (K-Folds)
2nd Performance 2
5th Performance 5
B C
Training Fold Data
Prediction
Validation
Training Fold Labels Fold Data
Performance
Model
Hyperparameter
Values Validation
Fold Labels
Learning
Algorithm
Model
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Figure 17: Illustration of leave-one-out cross-validation, which is a special case of k -fold cross-
validation, where k = n (where n is the number of examples in the training set).
Training set
y1 y2 ... yn
Level-1 predictions
in k-th iteration
Meta-Classifier
yf Final prediction
1
Figure 19: Illustration of stacking with cross-validation.
Title: Enhancing Credit Card Fraud Detection Using Ensemble Learning: A Case Study
Abstract:
Credit card fraud poses a significant challenge to financial institutions, requiring advanced
techniques for timely detection and prevention. This case study explores the application of
ensemble learning methods to enhance the accuracy and robustness of credit card fraud detection
systems. Ensemble learning combines multiple models to create a stronger, more reliable
predictive system. In this study, we leverage various algorithms such as Random Forest,
Gradient Boosting, and Bagging, and evaluate their performance in comparison to individual
models.
1. Introduction:
Credit card fraud is a prevalent issue affecting both financial institutions and cardholders.
Traditional methods of fraud detection often struggle to keep up with evolving tactics employed
by fraudsters. Ensemble learning, which combines the strengths of multiple models, offers a
promising solution to enhance fraud detection accuracy.
2. Dataset:
A comprehensive dataset containing historical credit card transactions is used for training and
testing the models. The dataset includes features such as transaction amount, location, time, and
various transaction patterns. A subset of the data contains instances of fraudulent transactions,
providing a balanced representation of both classes.
Several ensemble learning algorithms are implemented and compared, including Random Forest,
Gradient Boosting, and Bagging. Each algorithm is fine-tuned to optimize performance and
reduce false positives and false negatives.
4. Feature Engineering:
To improve the effectiveness of the models, feature engineering techniques are applied to extract
meaningful information from the dataset. This includes normalizing transaction amounts,
encoding categorical variables, and extracting temporal features.
5. Model Evaluation:
The ensemble models are evaluated using standard metrics such as accuracy, precision, recall, F1
score, and ROC-AUC. The evaluation process involves cross-validation and testing on a separate
dataset to ensure generalizability.
6. Results:
The results of the ensemble models are compared with those of individual models and traditional
fraud detection methods. The ensemble learning approach is expected to demonstrate superior
performance in terms of accuracy and robustness.
7. Interpretability:
An analysis of the interpretability of the ensemble models is conducted to provide insights into
the decision-making process. Understanding the factors contributing to fraud detection can aid in
model validation and trust-building.
8. Conclusion:
The case study concludes with a summary of the findings, highlighting the effectiveness of
ensemble learning in credit card fraud detection. Recommendations for implementing these
models in real-world scenarios and potential avenues for future research are also discussed.
Unit 5: Machine Learning in Practice
Data preprocessing, Model selection, Evaluation, Deployment, Ethics of machine learning
Data preprocessing
Data preprocessing is a crucial step in the machine learning pipeline that involves cleaning,
organizing, and transforming raw data into a format suitable for training and evaluating machine
learning models. The goal is to enhance the quality of the data, address potential issues, and
prepare it for effective model training. Here are key aspects of data preprocessing in the context
of machine learning projects:
1. Handling Missing Data:
Identify and handle missing values in the dataset. This may involve imputing
missing values using statistical measures (mean, median, mode) or advanced
techniques such as interpolation.
2. Dealing with Outliers:
Detect and address outliers that could significantly impact model performance.
This may involve removing outliers or transforming them to bring them within an
acceptable range.
3. Data Cleaning:
Clean the dataset by addressing inconsistencies, errors, or discrepancies. This
includes fixing typos, standardizing formats, and ensuring data consistency.
4. Feature Scaling:
Normalize or scale features to ensure that all features contribute equally to model
training. Common methods include Min-Max scaling or standardization (Z-score
normalization).
5. Encoding Categorical Variables:
Convert categorical variables into a format suitable for machine learning
algorithms. This often involves one-hot encoding, where categorical variables are
transformed into binary vectors.
6. Handling Imbalanced Data:
Address class imbalances in classification tasks to prevent models from being
biased towards the majority class. Techniques include oversampling,
undersampling, or using specialized algorithms for imbalanced data.
7. Feature Engineering:
Create new features or transform existing ones to capture more relevant
information for the model. Feature engineering can improve a model's ability to
understand complex relationships within the data.
8. Data Splitting:
Split the dataset into training, validation, and test sets. The training set is used to
train the model, the validation set helps tune hyperparameters, and the test set
evaluates the model's performance on unseen data.
9. Handling Time-Series Data:
For time-series data, handle temporal aspects such as time gaps, missing values,
and seasonality. This may involve resampling, lag features, or other time-specific
preprocessing steps.
10. Documentation and Logging:
Document all steps taken during data preprocessing and maintain a record of
transformations applied. Logging helps in reproducing results, debugging, and
ensuring transparency in the machine learning pipeline.
Model Selection in Machine Learning Projects: Picking the Right Tool for the Job
Choosing the right model for your ML project is like picking the perfect tool for a DIY project.
You wouldn't use a screwdriver for hammering, and you wouldn't use a chainsaw for trimming
nails! Similarly, in the vast toolbox of ML algorithms, selecting the best one for your specific
problem is crucial for success. Let's break down the practical side of model selection:
Why is it important?
Imagine throwing every tool you own at a problem! It's probably going to be messy and
inefficient. Likewise, using the wrong model in your project can lead to:
Poor performance: Inaccurate predictions, weak generalizability, and wasted resources.
Overfitting: Memorizing the training data but failing on unseen examples.
Underfitting: Failing to capture the underlying patterns in the data.
Factors to Consider:
Problem type: Regression, classification, clustering, etc. Each requires different types of
models.
Data characteristics: Size, dimensionality, complexity, distribution. Some models work
better with specific data types.
Project constraints: Training time, computational resources, model
interpretability. Different models have different demands.
Business needs: Accuracy, explainability, real-time predictions. Prioritize metrics
relevant to your goals.
Common Model Selection Techniques:
Hold-out validation: Split your data into training and testing sets, train models on the
training set, and evaluate them on the unseen testing set.
K-fold cross-validation: Randomly split your data into k folds, train models on k-1
folds, and test on the remaining fold, repeat for all k folds, and average the results.
Grid search and hyperparameter tuning: Explore different combinations of model
parameters (hyperparameters) to find the optimal configuration for your data.
Model comparison metrics: Accuracy, precision, recall, F1-score, AUC-ROC. Choose
metrics that align with your project goals.
Real-world Example:
Building a model to predict house prices. You might consider:
Regression models: Linear regression, decision trees, or gradient boosting for continuous
price prediction.
Feature engineering: Create features like "distance to amenities" or "average price in the
neighborhood".
Model comparison: Use k-fold cross-validation to compare performance of different
models on the same data.
Choose the model: Select the one with the best average score on the chosen metric
(e.g., mean squared error) while considering additional factors like interpretability for
explaining price variations.
Tips for Effective Selection:
Start simple: Don't jump into complex models too quickly. Consider baseline models for
initial comparisons.
Iterate and refine: Keep trying different models, hyper parameters, and feature
engineering techniques.
Understand the trade-offs: No model is perfect. Balance accuracy with factors like
training time, interpretability, and resource constraints.
Visualize your results: Use plots and charts to understand model behavior and compare
performance.
Remember, model selection is an art, not a science. There's no one-size-fits-all solution. By
carefully considering your specific project context and applying these practical tips, you can
choose the right tool for the job and build successful ML models that deliver real value.
Model evaluation
Model evaluation is a critical aspect of machine learning projects that involves assessing the
performance and generalization ability of a trained model. It helps determine how well the model
is likely to perform on new, unseen data. Here are key considerations and techniques for model
evaluation in the context of machine learning projects:
1. Metrics Selection:
Choose appropriate evaluation metrics based on the nature of the problem.
Common metrics include accuracy, precision, recall, F1 score for classification
tasks, and mean squared error, R-squared for regression tasks. The choice of
metric depends on the project goals and the specific characteristics of the data.
2. Confusion Matrix:
For classification problems, construct a confusion matrix to visualize the model's
performance in terms of true positives, true negatives, false positives, and false
negatives. This aids in understanding the types of errors the model makes.
3. Cross-Validation:
Implement cross-validation techniques, such as k-fold cross-validation or leave-
one-out cross-validation, to robustly assess model performance. This helps
mitigate the impact of variability in the training and test datasets.
4. Learning Curves:
Analyze learning curves to understand how the model's performance changes with
respect to the amount of training data. This helps identify issues like overfitting or
underfitting.
5. ROC Curve and AUC-ROC:
For binary classification problems, plot Receiver Operating Characteristic (ROC)
curves and calculate the Area Under the Curve (AUC-ROC). This provides
insights into the trade-off between sensitivity and specificity.
6. Precision-Recall Curve:
For imbalanced classification problems, examine precision-recall curves, which
illustrate the precision-recall trade-off at different decision thresholds.
7. Feature Importance:
Assess the importance of features in the model to understand which features
contribute most to the predictions. This can be done using techniques such as
permutation importance or feature importance plots.
8. Model Comparison:
Compare the performance of multiple models to select the best-performing one.
This can involve statistical tests or visualizations to highlight differences in
performance.
9. Bias and Fairness Evaluation:
Evaluate the model for bias and fairness, especially in sensitive applications.
Analyze the model's behavior across different demographic groups to ensure
equitable outcomes.
10. Deployment Considerations:
Consider the practical implications of deploying the model, including the
potential impact on end-users and the broader environment. Evaluate the model's
robustness to changes in the input data distribution.
Model deployment
Model deployment in the context of machine learning projects refers to the process of taking a
trained machine learning model and integrating it into a production environment where it can be
used to make predictions on new, unseen data. It involves several steps to ensure that the model
functions effectively, reliably, and securely in real-world applications. Here's an overview of the
key considerations in model deployment:
1. Model Serialization:
Serialize the trained model to a format that can be easily loaded and utilized by
the deployment environment. Common serialization formats include pickle,
joblib, or formats compatible with specific deployment frameworks.
2. Scalability:
Ensure that the deployed model can scale to handle varying workloads and data
volumes. This may involve using scalable infrastructure, containerization (e.g.,
Docker), or cloud-based solutions.
3. API Development:
Create an API (Application Programming Interface) to expose the model's
functionality, allowing other software applications to interact with and make
predictions using the model. RESTful APIs are commonly used for this purpose.
4. Input Validation:
Implement robust input validation to ensure that the incoming data meets the
model's expectations. This includes checking data types, ranges, and handling
missing values appropriately.
5. Security Measures:
Implement security measures to protect the deployed model from potential
vulnerabilities. This may involve encryption of data in transit, access controls, and
regular security audits.
6. Monitoring and Logging:
Set up monitoring systems to track the model's performance in real-time. Logging
should capture relevant information, including input data, predictions, and any
issues that may arise during deployment.
7. Versioning:
Establish version control for the deployed models to track changes and updates.
This ensures that different versions of the model can be easily managed, rolled
back, or upgraded without disrupting the system.
8. Model A/B Testing:
Implement A/B testing methodologies to evaluate the performance of different
model versions in a live environment. This allows for data-driven decisions
regarding model improvements or changes.
9. Downtime Mitigation:
Plan for and mitigate downtime during updates or maintenance. This may involve
deploying redundant instances, load balancing, or using strategies like canary
releases to minimize service interruptions.
10. Documentation:
Create comprehensive documentation for the deployed model, including
information on the API, input requirements, output format, and any specific
considerations for users or developers interacting with the model.
Model deployment is a critical phase that bridges the gap between the development of machine
learning models and their practical application in real-world scenarios. Effective deployment
ensures that the benefits of the trained model are realized in production while maintaining
performance, reliability, and security. It requires collaboration between data scientists, software
engineers, and DevOps teams to create a seamless and efficient deployment pipeline.