Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
157 views143 pages

R22 Machine Learning Digital Notes Final

The document outlines a Machine Learning course for II B.Tech (AI & ML) students, covering mathematical foundations, supervised and unsupervised learning algorithms, ensemble methods, and practical applications. It includes case studies and emphasizes the importance of data, training, features, algorithms, and evaluation in machine learning. The course aims to equip students with the skills to implement and deploy machine learning models effectively.

Uploaded by

ABHI TECH STUDIO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
157 views143 pages

R22 Machine Learning Digital Notes Final

The document outlines a Machine Learning course for II B.Tech (AI & ML) students, covering mathematical foundations, supervised and unsupervised learning algorithms, ensemble methods, and practical applications. It includes case studies and emphasizes the importance of data, training, features, algorithms, and evaluation in machine learning. The course aims to equip students with the skills to implement and deploy machine learning models effectively.

Uploaded by

ABHI TECH STUDIO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 143

Machine Learning

MR22- 1CS0204
II B.Tech (AI & ML) II SEM
Digital Notes
MALLAREDDY UNIVERSITY

II B.Tech II Semester
L/T/P/C

3/0/2/4

(MR22- 1CS0204) MACHINE LEARNING


PREREQUISITES:

Mathematical Foundations for Machine Learning, Data Mining


 Course Objectives:

o Understand the mathematical foundations of machine learning.

o Be able to apply supervised learning algorithms to solve real-world problems.

o Be able to apply unsupervised learning algorithms to solve real-world problems.

o Be able to apply ensemble algorithms to improve the performance of machine


learning models.

o Be able to deploy machine learning models in production.

Unit 1: Mathematical Foundations of Machine Learning

Introduction to machine learning, Linear algebra, Probability and statistics, Calculus,


Optimization

[Reference 1]

Case Study: Predicting Customer Churn Using Machine Learning: A Case Study in the
Telecom Industry

Unit 2: Supervised Learning Algorithms

Supervised Learning: regression and classification problems, simple linear regression, multiple
linear regression, ridge regression, logistic regression, k-nearest neighbor, naive Bayes classifier,
linear discriminant analysis, support vector machine, decision trees, bias-variance trade-off,
cross-validation methods such as leave-one-out (LOO) cross-validation, k-folds cross validation,
multi-layer perceptron, feed-forward neural network

Case Study: Predicting House Prices: A Comparative Study of Supervised Learning


Algorithms

[Reference 2]
Unit 3: Unsupervised Learning Algorithms

Unsupervised Learning: clustering algorithms, k-means/k-medoid, hierarchical clustering, top-


down, bottom-up: single-linkage, multiple linkage, dimensionality reduction, principal
component analysis.

Case study: Customer Segmentation for E-commerce: An Unsupervised Learning

[Reference 2&3]

Unit 4: Ensemble Algorithms

Bagging, Boosting, Random forests, AdaBoost, Gradient boosting.

Case Study: Credit Card Fraud Detection: Ensemble Learning

[Reference 2&3]

Unit 5: Machine Learning in Practice

Data preprocessing, Model selection, Evaluation, Deployment, Ethics of machine learning

Case Study: Predictive Maintenance in Manufacturing: A Machine Learning Case Study

[Reference 3]

Reference Textbooks:

[1] "Mathematics for Machine Learning" by Marc Peter Deisenroth, A. Aldo Faisal, and
Cheng Soon

[2] Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron

[3] Introduction to Machine Learning with Python by Sebastian Raschka and Vahid
Mirjalili

Course Outcomes:

o Upon completion of this course, students will be able to:

 Explain the basic concepts of machine learning.

 Implement supervised learning algorithms in Python.

 Implement unsupervised learning algorithms in Python.

 Implement ensemble algorithms in Python.

 Deploy machine learning models in production.


Unit 1: Mathematical Foundations of Machine Learning

Contents
1. Introduction to machine learning ................................................................................................ 1
Types of Machine Learning: ........................................................................................................... 3
Examples of Machine Learning Applications: ............................................................................... 5
2. Essential Linear Algebra Concepts in Machine Learning: ......................................................... 6
3. Essential Probability and Statistics Concepts in Machine Learning:.......................................... 8
4. Essential Calculus Concepts in Machine Learning:.................................................................. 10
5.Optimization .............................................................................................................................. 12
Python code for Unit 1 .................................................................................................................. 14
Summary: ...................................................................................................................................... 25

Introduction to machine learning, Linear algebra, Probability and statistics, Calculus,


Optimization

1. Introduction to machine learning


Machine Learning (ML) is a transformative field within the broader domain of artificial
intelligence (AI) that empowers computers to learn from data and improve their performance
over time without explicit programming. It involves the development of algorithms and models
that enable computers to recognize patterns, make predictions, and take actions based on data.

1
Key Concepts of Machine Learning:

1. Data:

 At the heart of machine learning is data. ML algorithms require large amounts of


data to learn patterns and make accurate predictions. The quality and quantity of
data significantly impact the performance of machine learning models.

2. Training:

 During the training phase, a machine learning model is exposed to a labeled


dataset, where it learns to identify patterns and relationships. The model adjusts
its parameters iteratively to minimize the difference between its predictions and
the actual outcomes.

3. Features and Labels:

 In supervised learning, a model learns from labeled data, where features (input
variables) are used to predict labels (output variable). For instance, in a spam
email filter, features might include the words in an email, and the label would
indicate whether the email is spam or not.
2
4. Algorithms:

 Various machine learning algorithms exist, each suited to different types of tasks.
Common algorithms include linear regression for predicting numerical values,
decision trees for classification, and neural networks for complex tasks like image
recognition.

5. Testing and Evaluation:

 After training, the model is tested on new, unseen data to evaluate its
performance. This step helps ensure that the model can generalize well to new
situations and make accurate predictions.

Types of Machine Learning:


1. Supervised Learning:

Definition: In supervised learning, the model is trained on a labeled dataset, where the input data
is paired with corresponding output labels. The goal is for the model to learn a mapping from
input features to the correct output by generalizing patterns from the training data.

Real-time Example: Email Spam Classification

 Input Features: Words, phrases, and metadata of an email.

 Output Label: Spam or Not Spam.

 Training: The model is trained on a dataset of labeled emails, learning to identify


patterns associated with spam or non-spam characteristics.

 Application: Once trained, the model can automatically classify incoming emails as
spam or not spam, helping filter unwanted messages.

2. Unsupervised Learning:

Definition: Unsupervised learning involves training a model on an unlabeled dataset, and the
algorithm explores the inherent structure in the data without predefined output labels. The goal is
often to find patterns, relationships, or groupings within the data.

Real-time Example: Customer Segmentation

 Input Data: Customer purchase history, demographics, and behavior.

 Task: Group customers into segments based on similarities.

3
 Application: This can help businesses tailor marketing strategies for different customer
segments, providing more personalized experiences and improving overall customer
satisfaction.

3. Self-Supervised Learning:

Definition: Self-supervised learning is a type of unsupervised learning where the model


generates its own labels from the input data, creating a pseudo-labeled dataset. The model learns
by predicting parts of the input data based on the remaining parts.

Real-time Example: Word Embeddings

 Input Data: Text corpus (sentences or paragraphs).

 Task: Predict missing words in sentences.

 Application: The model learns contextual representations of words, capturing semantic


relationships. These embeddings can be used in various natural language processing
tasks, such as sentiment analysis or language translation.

4. Reinforcement Learning:

Definition: Reinforcement learning involves an agent interacting with an environment and


learning to make decisions by receiving feedback in the form of rewards or penalties. The agent's
goal is to maximize cumulative rewards over time.

Real-time Example: Autonomous Driving

 Agent: Self-driving car.

 Environment: Roads, traffic, pedestrians.

 Task: Learn to navigate safely and reach a destination.

 Application: The self-driving car learns from experiences, receiving rewards (positive)
for safe driving and penalties (negative) for traffic violations. Over time, the model
optimizes its behavior to navigate efficiently and safely.

These types of machine learning cover a broad spectrum of applications, showcasing the
versatility of these approaches in solving various real-world problems. Each type has its
strengths and is suited to different scenarios depending on the nature of the data and the desired
outcome.

4
Examples of Machine Learning Applications:
1. Image Recognition:

 ML models can classify and recognize objects in images, enabling applications


such as facial recognition, self-driving cars, and medical image analysis.

2. Natural Language Processing (NLP):

 NLP models understand and generate human language, facilitating tasks like
language translation, sentiment analysis, and chatbot interactions.

3. Predictive Analytics:

 ML is used in business for predicting customer behavior, stock prices, and


equipment failures, helping organizations make informed decisions.

4. Healthcare Diagnostics:

 ML models analyze medical data to assist in diagnosing diseases, predicting


patient outcomes, and personalizing treatment plans.

5. Recommendation Systems:

 ML algorithms power recommendation engines in platforms like Netflix and


Amazon, suggesting products or content based on user preferences.

Machine learning continues to advance rapidly, revolutionizing industries and shaping the way
we interact with technology. As the field evolves, it holds the potential to address increasingly
complex challenges and drive innovation across various domains.

5
2. Essential Linear Algebra Concepts in Machine Learning:
2.1. Vectors and Vector Spaces:

 Vectors: Represent data points with both magnitude and direction (e.g., features in a
dataset).

 Vector Spaces: Collections of vectors closed under addition and scalar


multiplication, defining the "playground" for operations.

2.2. Matrices and Operations:

 Matrices: Two-dimensional arrays of numbers used to represent relationships between


vectors (e.g., coefficients in a linear model).

 Matrix Operations: Addition, multiplication, and other operations that manipulate and
analyze data (e.g., calculating distances, projections).

2.3. Linear Transformations:

 Functions that map vectors to other vectors while preserving linear relationships
(e.g., rotations, scaling).

 Used in various algorithms like linear regression, neural networks, and dimensionality
reduction.

2.4. Systems of Linear Equations and Optimization:

 Solving systems of linear equations: Finding unknown values based on their linear
relationships (e.g., fitting a line to data points).

 Optimization problems: Finding the best parameters for a model by minimizing or


maximizing a function (e.g., cost function).

2.5. Eigenvalues and Eigenvectors:

 Eigenvalues: Scaling factors by which a vector is transformed by a linear transformation


(e.g., principal components in PCA).

 Eigenvectors: Directions along which the transformation has the same scaling factor
(e.g., principal axes in PCA).

 Used in dimensionality reduction, feature extraction, and other techniques.

6
Additional concepts:

 Norm & Inner Product: Measuring the size and angle of vectors, respectively.

 Orthogonality & Basis Vectors: Independent vectors representing the entire space.

 Singular Value Decomposition (SVD): Decomposing matrices for dimensionality


reduction and data analysis.

Remember, these are just the foundation. Each concept has nuances and applications specific to
different ML algorithms.

Real-world applications:

 Linear Regression: Predicting continuous target variables based on features.

 Classification: Categorizing data points based on features.

 Dimensionality Reduction: Simplifying complex data by preserving relevant


information.

 Recommender Systems: Predicting user preferences based on past interactions.

 Natural Language Processing: Extracting meaning from text data.

Learning resources:

 3Blue1Brown's Linear Algebra series: https://m.youtube.com/watch?v=fNk_zzaMoSs

 MIT OpenCourseware: https://ocw.mit.edu/courses/18-06sc-linear-algebra-fall-2011/

 Stanford CS229: https://online.stanford.edu/courses/cs229-machine-learning

By mastering these concepts, you'll unlock the power of linear algebra for building effective and
insightful machine learning models!

7
3. Essential Probability and Statistics Concepts in Machine Learning:
Probability:

 Random Variables: Represent uncertain quantities with possible outcomes and their
associated probabilities (e.g., coin flip outcome).

 Probability Distributions: Describe the distribution of probabilities across possible


outcomes (e.g., normal distribution for height).

 Conditional Probability: Probability of one event happening given another (e.g., chance
of rain given cloudy skies).

 Bayes' Theorem: Updating beliefs about a hypothesis based on new evidence (e.g., spam
filter learning from user feedback).

Statistics:

 Descriptive Statistics: Summarizing data trends, like mean, median, standard


deviation, and variance.

 Inferential Statistics: Drawing conclusions about populations from samples (e.g., testing
hypotheses about model accuracy).

 Regression Analysis: Modeling relationships between variables (e.g., predicting house


prices based on features).

 Correlation & Causation: Distinguishing between association and cause-and-effect


relationships.

 Dimensionality Reduction: Transforming data into lower dimensions while preserving


information (e.g., PCA).

Applications in Machine Learning:

 Model Building & Training: Choosing appropriate algorithms, evaluating model


performance, and preventing overfitting.

 Feature Engineering & Selection: Selecting informative features and preprocessing data
for better learning.

 Uncertainty Quantification & Error Analysis: Understanding confidence levels in


predictions and identifying potential biases.

 Decision Making & Risk Assessment: Assessing costs and benefits of different model
outputs and actions.

8
 Anomaly Detection & Outlier Identification: Identifying unusual data points that may
indicate errors or new patterns.

Additional Concepts:

 Central Limit Theorem: Approximating the distribution of averages with a normal


distribution for large samples.

 Maximum Likelihood Estimation: Finding model parameters that maximize the


probability of observed data.

 Markov Chain Monte Carlo (MCMC): Sampling from complex probability


distributions through iterative simulations.

 Bayesian Networks: Representing relationships between variables using directed acyclic


graphs.

Remember, a solid understanding of probability and statistics is crucial for building robust,
reliable, and interpretable machine learning models.

9
4. Essential Calculus Concepts in Machine Learning:
Calculus, the study of change, plays a vital role in understanding and optimizing machine
learning models. Here are some key concepts:

1. Derivatives:

 Rate of change of a function with respect to its input variable. Used for:

o Gradient Descent: Optimizing model parameters by finding the direction of


steepest decrease in the loss function.

o Backpropagation: Propagating errors backwards through neural networks to


adjust weights.

o Feature Importance: Understanding how changes in features affect the model


output.

2. Partial Derivatives:

 Rate of change of a multi-variable function with respect to one variable while holding
others constant. Used for:

o Optimizing models with multiple parameters.

o Analyzing interactions between features in complex models.

3. Minima and Maxima:

 Points where the derivative of a function equals zero. Used for:

o Finding optimal solutions for optimization problems.

o Identifying potential errors in models (e.g., saddle points).

4. Taylor Series:

 Approximation of a function around a specific point using its derivatives. Used for:

o Building local approximations of complex functions for optimization.

o Understanding how small changes in inputs affect outputs.

10
5. Integration:

 Summation of a function over a given interval. Used for:

o Calculating areas under curves (e.g., loss surfaces).

o Computing expectations of probability distributions.

6. Convex Optimization:

 Finding the minimum of a convex function. Used for:

o Training many machine learning models efficiently.

o Guaranteeing convergence to optimal solutions.

Additional concepts:

 Hessian Matrix: Second derivative matrix used for analyzing local optima and saddle
points.

 Lagrange Multipliers: Optimizing functions under constraints.

 Numerical Methods: Approximation techniques for solving calculus problems


computationally.

Machine Learning Applications:

 Regression Analysis: Fitting models to data and understanding relationships between


variables.

 Classification: Predicting categories of data points.

 Support Vector Machines: Finding optimal hyperplanes for classification.

 Neural Networks: Learning complex relationships from data through gradient descent.

 Bayesian Inference: Updating beliefs about parameters based on new evidence.

11
6. Optimization
Optimization is a critical component of machine learning, where the goal is to find the
best possible parameters for a model to achieve optimal performance. Here are some
important concepts in optimization in the context of machine learning:

1. Objective Function:

 Definition: The objective function, also known as the loss function or cost
function, measures the performance of a model for a given set of parameters.

 Relevance to Machine Learning: Optimization aims to minimize (or maximize)


the objective function, guiding the learning process toward the most effective
model parameters.

2. Optimization Algorithms:

 Definition: Optimization algorithms are methods used to update the parameters of


a model iteratively to minimize the objective function.

 Relevance to Machine Learning: Popular optimization algorithms include


Gradient Descent, Stochastic Gradient Descent (SGD), Adam, RMSprop, and
others, which play a crucial role in training machine learning models.

3. Gradient Descent:

 Definition: Gradient Descent is an iterative optimization algorithm that uses the


gradient of the objective function to update model parameters in the direction of
steepest decrease.

 Relevance to Machine Learning: Gradient Descent is widely used in training


neural networks and other machine learning models to minimize the loss function.

4. Stochastic Gradient Descent (SGD):

 Definition: SGD is a variant of Gradient Descent that uses a random subset


(mini-batch) of training data to update parameters, introducing stochasticity to the
optimization process.

 Relevance to Machine Learning: SGD accelerates optimization on large


datasets and is a common choice in training deep learning models.

12
5. Learning Rate:

 Definition: The learning rate is a hyperparameter that controls the size of the
steps taken during optimization. It influences the convergence speed and stability
of the algorithm.

 Relevance to Machine Learning: Choosing an appropriate learning rate is


crucial, as too small a rate may lead to slow convergence, while too large a rate
may cause oscillations or divergence.

6. Hyper parameter Tuning:

 Definition: Hyper parameter tuning involves optimizing the hyper parameters of


a machine learning model, such as the learning rate, regularization strength, or the
number of hidden units.

 Relevance to Machine Learning: Proper tuning improves model performance


and generalization, impacting the optimization process.

7. Backpropagation:

 Definition: Backpropagation is a key algorithm in training neural networks. It


computes the gradient of the loss function with respect to the model's parameters.

 Relevance to Machine Learning: Backpropagation enables efficient


optimization by propagating errors backward through the network to update
weights and biases.

8. Convex and Non-convex Optimization:

 Definition: Convex optimization involves functions with a single minimum (or


maximum) point, making them relatively straightforward to optimize. Non-
convex optimization involves functions with multiple local optima.

 Relevance to Machine Learning: Many machine learning problems involve non-


convex optimization, and finding a global minimum can be challenging. Gradient-
based methods may get stuck in local minima.

9. Regularization:

 Definition: Regularization techniques penalize complex models by adding a


regularization term to the objective function, preventing overfitting.

 Relevance to Machine Learning: Regularization is essential in optimization to


balance model complexity and fit to the training data, improving generalization to
unseen data.

13
10. Convergence Criteria:

 Definition: Convergence criteria define when an optimization algorithm should


stop iterating, typically based on changes in the objective function or parameters.

 Relevance to Machine Learning: Setting appropriate convergence criteria


ensures that the optimization process stops when the model has reached a
satisfactory state.

Understanding and applying optimization concepts are crucial for efficiently training machine
learning models, especially in the context of deep learning, where optimization plays a central
role in learning hierarchical representations.

Python code for Unit 1


1. Linear algebra:

o Program 1: Implement a program to calculate the dot product of two vectors.

14
o Program 2: Implement a program to calculate the eigenvalues and
eigenvectors of a matrix.

15
o Program 3: Implement a program to solve a system of linear equations using
Gaussian elimination.

16
2. Probability and statistics:

o Program 1: Implement a program to generate a random variable from a


given probability distribution.

17
o Program 2: Implement a program to calculate the probability of a certain
event occurring.

o Program 3: Implement a program to conduct a hypothesis test.

18
3. Calculus:

o Program 1: Implement a program to calculate the derivative of a function.

o Program 2: Implement a program to calculate the integral of a function.

19
o Program 3: Implement a program to solve a differential equation.

20
4. Optimization:

o Program 1: Implement a program to find the minimum of a function using


gradient descent.

21
o Program 2: Implement a program to find the maximum of a function using
Newton's method.

22
o Program 3: Implement a program to solve a linear programming problem.

23
24
Summary:
Introduction to Machine Learning: Machine learning is a field of artificial intelligence that
focuses on the development of algorithms and models that enable computers to learn and make
predictions or decisions without explicit programming. It encompasses various techniques,
including supervised learning, unsupervised learning, and reinforcement learning. Machine
learning applications range from image recognition and natural language processing to
recommendation systems and autonomous vehicles.

Linear Algebra in Machine Learning: Linear algebra plays a crucial role in machine learning
as it provides the foundation for many operations and concepts. Vectors, matrices, and linear
transformations are fundamental components used in representing data, defining models, and
performing operations like matrix multiplication, eigendecomposition, and solving linear
equations. Linear algebra is essential for understanding algorithms like linear regression,
principal component analysis, and neural networks.

Probability and Statistics in Machine Learning: Probability and statistics are essential tools in
machine learning for dealing with uncertainty, making predictions, and drawing inferences from
data. Probability theory is used to model uncertainty, while statistical methods help in analyzing
and interpreting data. Concepts like probability distributions, statistical tests, confidence
intervals, and hypothesis testing are critical for understanding the behavior of models and
making informed decisions in machine learning.

Calculus in Machine Learning: Calculus is fundamental to machine learning, especially in the


optimization of models. Gradient descent, a widely used optimization algorithm, relies on
concepts from calculus, such as derivatives and partial derivatives. Calculus helps in finding the
minimum or maximum points of a function, which is crucial for optimizing model parameters. It
is also used in understanding the rate of change in functions and the optimization of loss
functions in training machine learning models.

Optimization in Machine Learning: Optimization involves finding the best possible values for
the parameters of a model to achieve optimal performance. In machine learning, optimization
techniques are employed to minimize or maximize objective functions, often representing the
error or cost of a model. Gradient descent is a common optimization algorithm, and variations
like stochastic gradient descent are widely used for training models. Optimization is crucial for
fine-tuning models and improving their accuracy and efficiency.

In summary, a strong foundation in machine learning requires a deep understanding of linear


algebra for data representation and transformations, probability and statistics for modeling
uncertainty, calculus for optimization, and optimization techniques for training and improving
machine learning models. Together, these concepts provide a comprehensive toolkit for
developing and understanding advanced machine learning algorithms.

25
Unit 2: Supervised Learning Algorithms
Supervised Learning: regression and classification problems, simple linear regression,
multiple linear regression, ridge regression, logistic regression, k-nearest neighbor, naive
Bayes classifier, linear discriminant analysis, support vector machine, decision trees, bias-
variance trade-off, cross-validation methods such as leave-one-out (LOO) cross-validation, k-
folds cross validation, multi-layer perceptron, feed-forward neural network
Case Study: Predicting House Prices: A Comparative Study of Supervised Learning
Algorithms.

Regression is a statistical process for estimating the relationships between the dependent
variables or criterion variables and one or more independent variables or predictors.
Regression analysis explains the changes in criteria in relation to changes in select
predictors. The conditional expectation of the criteria is based on predictors where the
average value of the dependent variables is given when the independent variables are
changed. Three major uses for regression analysis are determining the strength of predictors,
forecasting an effect, and trend forecasting.

Classification is a central topic in machine learning that has to do with teaching machines how
to group together data by particular criteria. Classification is the process where computers
group data together based on predetermined characteristics — this is called supervised learning.
Classification is simply grouping things together according to similar features and attributes.
When you go to a grocery store, you can fairly accurately group the foods by food group
(grains, fruit, vegetables, meat, etc.) In machine learning, classification is all about teaching
computers to do the same.
Examples of classification is as follows:
Classifying Images, Speech Tagging, Music Identification.
A common example of classification comes with detecting spam emails. To write a program
to filter out spam emails, a computer programmer can train a machine learning algorithm with
a set of spam-like emails labelled as spam and regular emails labelled as not-spam. The idea
is to make an algorithm that can learn characteristics of spam emails from this training set so
that it can filter out spam emails when it encounters new emails.
Simple Linear regression is used for predictive analysis. Linear regression is a linear
approach for modelling the relationship between the criterion or the scalar response and the
multiple predictors or explanatory variables. Linear regression focuses on the conditional
probability distribution of the response given the values of the predictors. For linear
regression, there is a danger of overfitting. The formula for linear regression is: Y’ = bX + A.

1
Ordinary Least Square (OLS) Method for Linear Regression
The ordinary least square method (OLS) for simple linear regression. If you are new to linear
regression, It will help us to understand how simple linear regression works step-by-step.

The simple linear regression is a model with a single regressor (independent variable)
x that has a relationship with a response (dependent or target) y that is a

y = β0 + β1 x + ε — — — — — — — — — — (1)
Where β0: intercept
β1: slope (unknown constant)
ε: random error component
This is a line where y is the dependent variable we want to predict, x is the independent
variable, and β0 and β1 are the coefficients that we need to estimate.
Estimation of β0 and β1 :
The OLS method is used to estimate β0 and β1. The OLS method seeks to minimize the sum
of the squared residuals. This means from the given data we calculate the distance from each
data point to the regression line, square it, and the sum of all of the squared errors together.

From equation (1) we may write


yi = β0 + β1 x + εi, i = 1, 2, …..n — — — — — — — — — (2)

The equation (2) is a sample regression model, written in terms of the n pairs of data (yi, xi)
where (i = 1, 2, ,..,n).

Example of Ordinary Least Square Method:

2
Let’s take a simple example. This table shows some data from the manufacturing company.
Each row in the table shows the sales for a year and the amount spent on advertising that year.
Here our target variable is sales-which we want to predict.
Linear Regression estimates that Sales = β0 + β1 x

Estimating the Slope ( β1):

1. Calculate the mean value of x and y

2. Calculate the error of each variable from the mean

3. Multiply the error for each x with the error for each y and calculate the sum of
these multiplications

3
4. Square the residual of each x value from the mean and sum of these
squared values

Now we have all the values to calculate the slope (β1) =


221014.5833/8698.694 = 25.41.
Estimating the Intercept ( β0):
β0 = mean(y)-(β1*mean(x))
we already know all the values to calculate
β0= 968.5- (25.41*37.8333) = 7.239
Now, we have the coefficients for our simple linear regression. y = 7.239 + 25.41* x

4
Example of Simple Linear Regression:-
In this very simple example, we'll explore how to create a very simple fit line, the classic case
of y=mx+b. We'll go carefully through each step, so you can see what type of question a simple
fit line can answer.

5
6
7
Performance Evaluation
1. SSE-Sum of Squares Error
The error or residual is the difference between the actual value and the predicted
value. The sum of all errors can cancel out since it can contain negative signs and
give zero. So, we square all the errors and sum it up. The line which gives us the
least sum of squared errors is the best fit. The line of best fit always goes through x̅
and ȳ.
In Linear Regression, the line of best fit is calculated by minimizing the error (the
distance between data points and the line).
Sum of Squares Errors is also known as Residual error or Residual sum of squares

2. SSR Sum of Squares due to Regression


SSR is also known as Regression Error or Explained Error.
It is the sum of the differences between the predicted value and the mean of the
dependent variable ȳ

3. SST Sum of Squares Total


SST/Total Error = Sum of squared errors + Regression Error.
Total Error or Variability of the data set is equal to the variability explained by
the regression line (Regression Error) plus the unexplained variability (SSE)
known as error or residuals.

8
9
RMSE -Root Mean Squared Error
RMSE is a measure of how spread out these residuals are.
In other words, it tells you how concentrated the data is around the line of best fit.
RMSE is calculated by taking the square root of MSE.
Interpretation of RMSE:
RMSE is interpreted as the standard deviation of unexplained variance (MSE).
RMSE contains the same units as the dependent variable.
Lower values of RMSE indicates a better fit.

Multiple Linear Regression:-


Multiple Linear Regression is an extension of Simple Linear regression as it takes
more than one predictor variable to predict the response variable.
Multiple Linear Regression is one of the important regression algorithms which
models the linear relationship between a single dependent continuous variable and
more than one independent variable.
It is a statistical model that utilizes two or more quantitative and qualitative
explanatory variables (x1,..., xp) to predict a quantitative dependent variable Y.
Caution: have at least two or more quantitative explanatory variables (rule of thumb)
Example:
Prediction of CO2 emission based on engine size and number of cylinders in a car.

Assumptions for Multiple Linear Regression:

• A linear relationship should exist between the Target and predictor variables.
• The regression residuals must be normally distributed.
• Multiple Linear Regression assumes little or no multicollinearity (correlation between
the independent variable) in data.

10
Polynomial Regression is a regression algorithm that models the relationship between a
dependent(y) and independent variable(x) as nth degree polynomial.

y= b0+b1x1+ b2x12+ b3x13+...... bnx1n

▪ It is also called the special case of Multiple Linear Regression in ML. Because we add
some polynomial terms to the Multiple Linear regression equation to convert it into
Polynomial Regression.
▪ It is a linear model with some modification in order to increase the accuracy.
▪ The dataset used in Polynomial regression for training is of non-linear nature.
▪ It makes use of a linear regression model to fit the complicated and non-linear functions
and datasets.
▪ Hence, "In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modeled using a linear
model."

Need for Polynomial Regression:-

If we apply a linear model on a linear dataset, then it provides us a good result as we have
seen in Simple Linear Regression, but if we apply the same model without any modification
on a non-linear dataset, then it will produce a drastic output. Due to which loss function will
increase, the error rate will be high, and accuracy will be decreased.

So for such cases, where data points are arranged in a non-linear fashion, we need the
Polynomial Regression model.

11
Logistic Regression :-
Logistic Regression is one of the supervised machine learning algorithms used for
classification. In logistic regression, the dependent variable is categorical.
The objective of the model is, given the independent variables, what is the class
likely to be? [For binary classification, 0 or 1]

In Logistic Regression-binary classification, we will predict the output as 0 or 1.


Example:

1. Diabetic (1) or not (0)


2. Spam (1) or Ham (0)
3. Malignant(1) or not (0)

In Linear Regression, output prediction will be continuous. So, if we fit a linear


model, it won’t predict the output between 0 and 1.
So, we have to transform the linear model to the S -curve using the sigmoid
function, which will convert the input between 0 and 1.
Sigmoid Function:-
The sigmoid function is used to convert the input into range 0 and 1.

1. If z → -∞, sigmoid(z) → 0
2. If z → ∞ , sigmoid(z) → 1
3. If z=0, sigmoid(z)=0.5

Sigmoid Curve

12
So, if we input the linear model to the sigmoid function, it will convert the input
between range 0 and 1
In Linear regression, the predicted value of y is calculated by using the below
equation.

In logistic regression, ŷ is p(y=1|x). This means ŷ provides an estimate of the


probability that y=1, given a particular set of values for independent variables(x)
is represented as if the predicted value is close to 1 means we can be more certain
that the data point belongs to class 1.
If the predicted value is close to 0 means we can be more certain that the data point
belongs to class 0.
How to determine the best fit sigmoid curve?
Cost Function
Why not least squares as cost function?
In logistic regression, the actual y value will be 0 or 1. The predicted y value ŷ will
be between 0 and 1.
In the least-squares method, the error is calculated by subtracting actual y and
predicted y value and squaring them
Error =(y-ŷ)²
If we calculate least squares for the misclassified data
point say y=0 and ŷ is close to 1, the error will be very
less only.
The cost incurred is very less even for misclassified data points. This is one of the
reasons, least squares is not used as a cost function for logistic regression.

Cost Function — Log loss (Binary Cross Entropy)


Log loss or Binary Cross Entropy is used as a cost function for logistic regression

Let’s check some properties of the classification cost function

13
1. If y=ŷ, Error should be zero
2. The error should be very high for misclassification 3.
The error should be greater than or equal to zero.

Let’s check whether these properties hold good for the log loss or binary cross-
entropy function.

1. If y=ŷ, the error should be zero.

Case 1: y=0 and ŷ=0 or close to 0

Case 2: y=1 and ŷ=1 or close to 1.

ln1 =0 and ln0 = -∞


2. The error should be very high for misclassification

Case 1: y=1 and ŷ=0 or close to 0

Case 2: y=0 and ŷ=1 or close to 1

14
The error tends to be very high for misclassified data points.
3.The error should be greater than or equal to zero.

Error= -{y ln ŷ + (1-y) ln (1-ŷ)}

→ y is either o or 1
→ ŷ is always between 0 and 1
→ ln ŷ is negative and ln (1-ŷ) is negative
→ negative sign before the expression is included to make the error positive
[ In linear regression least-squares method, we will be squaring the error]
So, the error will be always greater than or equal to zero.

Interpreting Model coefficient

To interpret the model coefficient, we need to know the terms odds, log odds, odds
ratio.

Odds, Log Odds, Odds Ratio

Odds
Odds is defined as the probability of an event occurring divided by the probability
of the event not occurring.

Example: Odds of getting 1 while rolling a fair die

15
Log odds (Logit Function)
Log odds =ln(p/1-p)
After applying the sigmoid function, we know that

From this equation, odds can be written as,

Log Odds = ln(p/1-p) = β0+ β1x

So, we can convert the logistic regression as a linear function by using log odds.

Odds Ratio
Odds Ratio is the ratio of two odds

Interpreting Logistic Regression Coefficient

Logistic Regression model


β0 → Log odds is β0 when X is zero.
β1 → Change in log-odds associated with
variable X1.
If X1 is numerical variables, β1 indicates, for every one-unit increase in X1, log odds
is increased by β1.
If X1 is a binary categorical variable, β1 indicates, change in log odds for x1=1
relative to X1=0.
How to get Odds Ratio from the model coefficient?

16
Odds Ratio in Logistic Regression
The odds ratio of an independent variable in logistic regression depends on how
that odds change with one unit increase in that particular variable by keeping all
the other independent variables constant.
β1 → Change in log-odds associated with
variable X1. The odds Ratio for variable X1 is the
exponential of β1

Ridge Regression:-
Ridge regression is a model tuning method that is used to analyse any data that
suffers from multicollinearity. This method performs L2 regularization. When the
issue of multicollinearity occurs, least-squares are unbiased, and variances are large,
this results in predicted values being far away from the actual values.

The cost function for ridge regression:

Min (||Y – X(theta)||^2 + λ||theta||^2)

17
Lambda is the penalty term. λ given here is denoted by an alpha parameter in the ridge function.
So, by changing the values of alpha, we are controlling the penalty term. The higher the values
of alpha, the bigger is the penalty and therefore the magnitude of coefficients is reduced.
• It shrinks the parameters. Therefore, it is used to prevent multicollinearity
• It reduces the model complexity by coefficient shrinkage
• Check out the free course on regression analysis.

Ridge Regression Models:

For any type of regression machine learning model, the usual regression equation forms the
base which is written as:

Y = XB + e

Where Y is the dependent variable, X represents the independent variables, B is the regression
coefficients to be estimated, and e represents the errors are residuals.

Once we add the lambda function to this equation, the variance that is not evaluated by the
general model is considered. After the data is ready and identified to be part of L2
regularization, there are steps that one can undertake.

Standardization:

In ridge regression, the first step is to standardize the variables (both dependent and
independent) by subtracting their means and dividing by their standard deviations. This causes
a challenge in notation since we must somehow indicate whether the variables in a particular
formula are standardized or not. As far as standardization is concerned, all ridge regression
calculations are based on standardized variables. When the final regression coefficients are
displayed, they are adjusted back into their original scale.
However, the ridge trace is on a standardized scale.

Bias and variance trade-off:

Bias and variance trade-off is generally complicated when it comes to building ridge regression
models on an actual dataset. However, following the general trend which one needs to
remember is:

1. The bias increases as λ increases.


2. The variance decreases as λ increases.

Assumptions of Ridge Regressions:

The assumptions of ridge regression are the same as that of linear regression: linearity, constant
variance, and independence. However, as ridge regression does not provide confidence limits,
the distribution of errors to be normal need not be assumed.
Example:
We shall consider a data set on Food restaurants trying to find the best combination of food
items to improve their sales in a particular region.
K-Nearest Neighbor(KNN) :-

18
• K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a
well suite category by using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
• It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
• Example: Suppose, we have an image of a creature that looks similar to cat and dog,
but we want to know either it is a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure. Our KNN model will find the
similar features of the new data set to the cats and dogs images and based on the most
similar features it will put it in either cat or dog category.

Use of K-NN Algorithm:-


Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of problem,
we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class
of a particular dataset.

Consider the below diagram:

19
K-NN Algorithm:

• Step-1: Select the number K of the neighbors


• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
• Step-4: Among these k neighbors, count the number of the data points in each category.
• Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
• Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the
below image:

• Firstly, we will choose the number of neighbors, so we will choose the k=5.
• Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:

20
• By calculating the Euclidean distance, we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:

• As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.

How to select the value of K in the K-NN Algorithm:

Below are some points to remember while selecting the value of K in the K-NN algorithm:

• There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
• Large values for K are good, but it may find some difficulties.

21
Advantages of KNN Algorithm:

• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

• Always needs to determine the value of K which may be complex some time.
• The computation cost is high because of calculating the distance between the data points
for all the training samples.

Naïve Bayes Classifier: -


• Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
• It is mainly used in text classification that includes a high-dimensional training dataset.
• Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick
predictions.
• It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:

• Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
• Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:

• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
• The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

22
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Examples of Naïve Bayes Algorithm: -


1. Spam filtration
2. Sentimental analysis
3. classifying articles.

Working of Naïve Bayes' Classifier can be understood with the help of the
below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So, using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

23
24
Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So, P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5.

P(No)= 0.29

P(Sunny)= 0.35

So, P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So, as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other Algorithms.
• It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

• Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.

Applications of Naïve Bayes Classifier:

• It is used for Credit Scoring.


• It is used in medical data classification.
• It can be used in real-time predictions because Naïve Bayes Classifier is an eager
learner.
• It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

25
• Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes
that these values are sampled from the Gaussian distribution.
• Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
• Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word
is present or not in a document. This model is also famous for document classification
tasks.

Linear Discriminant Analysis (LDA): -


Linear Discriminant Analysis (LDA) is a supervised learning algorithm used for classification
tasks in machine learning. It is a technique used to find a linear combination of features that
best separates the classes in a dataset. And it is one of the commonly used dimensionality
reduction techniques in machine learning to solve more than two-class classification problems.
It is also known as Normal Discriminant Analysis (NDA) or Discriminant Function Analysis
(DFA).
LDA can be used to project the features of higher dimensional space into lower-dimensional
space in order to reduce resources and dimensional costs. It is also considered a pre-processing
step for modelling differences in ML and applications of pattern classification.
LDA works by projecting the data onto a lower-dimensional space that maximizes the
separation between the classes. It does this by finding a set of linear discriminants that
maximize the ratio of between-class variance to within-class variance. In other words, it finds
the directions in the feature space that best separate the different classes of data.
LDA assumes that the data has a Gaussian distribution and that the covariance matrices of the
different classes are equal. It also assumes that the data is linearly separable, meaning that a
linear decision boundary can accurately classify the different classes.

For example, we have two classes and we need to separate them efficiently. Classes can have
multiple features. Using only a single feature to classify them may result in some overlapping
as shown in the below figure. So, we will keep on increasing the number of features for proper
classification.

Example:
Suppose we have two sets of data points belonging to two different classes that we want to
classify. As shown in the given 2D graph, when the data points are plotted on the 2D plane,
there’s no straight line that can separate the two classes of the data points completely. Hence,
in this case, LDA (Linear Discriminant Analysis) is used which reduces the 2D graph into a
1D graph in order to maximize the separability between the two classes.

26
Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis and
projects data onto a new axis in a way to maximize the separation of the two categories and
hence, reducing the 2D graph into a 1D graph.

Two criteria are used by LDA to create a new axis:

1. Maximize the distance between means of the two classes.


2. Minimize the variation within each class.

27
In the above graph, it can be seen that a new axis (in red) is generated and plotted in the 2D
graph such that it maximizes the distance between the means of the two classes and minimizes
the variation within each class. In simple terms, this newly generated axis increases the
separation between the data points of the two classes. After generating this new axis using the
above-mentioned criteria, all the data points of the classes are plotted on this new axis and are
shown in the figure given below.

Uses or Advantages of LDA: -

• Logistic Regression is one of the most popular classification algorithms that perform
well for binary classification but falls short in the case of multiple classification
problems with well-separated classes. At the same time, LDA handles these quite
efficiently.
• LDA can also be used in data pre-processing to reduce the number of features, just as
PCA, which reduces the computing cost significantly.
• LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to extract
useful data from different faces. Coupled with eigenfaces, it produces effective results.

Drawback of Linear Discriminant Analysis (LDA):-

But LDA also fails in some cases where the Mean of the distributions is shared. In this case,
LDA fails to create a new axis that makes both the classes linearly separable.

To overcome such problems, we use non-linear Discriminant analysis in machine learning.

Extensions to LDA:

1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of
variance (or covariance when there are multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs are
used such as splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the
estimate of the variance (actually covariance), moderating the influence of different
variables on LDA.

Some of the common Real-world Applications of Linear discriminant Analysis are given
below:

• Face Recognition
Face recognition is the popular application of computer vision, where each face is
represented as the combination of a number of pixel values. In this case, LDA is used
to minimize the number of features to a manageable number before going through the
classification process. It generates a new template in which each dimension consists of
a linear combination of pixel values. If a linear combination is generated using Fisher's
linear discriminant, then it is called Fisher's face.

28
• Medical
In the medical field, LDA has a great application in classifying the patient disease on
the basis of various parameters of patient health and the medical treatment which is
going on. On such parameters, it classifies disease as mild, moderate, or severe. This
classification helps the doctors in either increasing or decreasing the pace of the
treatment.
• Customer Identification
In customer identification, LDA is currently being applied. It means with the help of
LDA; we can easily identify and select the features that can specify the group of
customers who are likely to purchase a specific product in a shopping mall. This can be
helpful when we want to identify a group of customers who mostly purchase a product
in a shopping mall.
• For Predictions
LDA can also be used for making predictions and so in decision making. For example,
"will you buy this product” will give a predicted result of either one or two possible
classes as a buying or not.
• In Learning
Nowadays, robots are being trained for learning and talking to simulate human work,
and it can also be considered a classification problem. In this case, LDA builds similar
groups on the basis of different parameters, including pitches, frequencies, sound,
tunes, etc.

Support Vector Machine (SVM):-


Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using
a decision boundary or hyperplane:

29
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it
can learn about different features of cats and dogs, and then we test it with this strange creature.
So as support vector creates a decision boundary between these two data (cat and dog) and
choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis
of the support vectors, it will classify it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text categorization, etc.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-


dimensional space, but we need to find out the best decision boundary that helps to classify the
data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there
are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors

The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.

30
Types of SVM

SVM can be of two types:

• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is termed
as non-linear data and classifier used is called as Non-linear SVM classifier.

SVM working process:

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we have
a dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We
want a classifier that can classify the pair (x1, x2) of coordinates in either green or blue.
Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:

31
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from
both the classes. These points are called support vectors. The distance between the vectors and
the hyperplane is called as margin. And the goal of SVM is to maximize this margin. The
hyperplane with maximum margin is called the optimal hyperplane.

32
Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:

So, to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third-dimension z. It can be
calculated as: z=x2 +y2

By adding the third dimension, the sample space will become as below image:

33
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:

Hence, we get a circumference of radius 1 in case of non-linear data.

34
Decision Tree: -
Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.

• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
• The decisions or the test are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
• In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
• A decision tree can contain categorical data (YES/NO) as well as numeric data.
• Below diagram explains the general structure of a decision tree:

35
Use of Decision Trees:-

There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:

• Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
• The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Decision Tree Terms: -

• Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:

• Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
• Step-3: Divide the S into subsets that contains possible values for the best attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node. Finally,
the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider the
below diagram:

36
Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:

• Information Gain
• Gini Index

1. Information Gain:

• Information gain is the measurement of changes in entropy after the segmentation of a


dataset based on an attribute.
• It calculates how much information a feature provides us about a class.
• According to the value of information gain, we split the node and build the decision
tree.
• A decision tree algorithm always tries to maximize the value of information gain, and
a node/attribute having the highest information gain is split first. It can be calculated
using the below formula:

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy (each feature)]

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

37
Where,

• S= Total number of samples


• P(yes)= probability of yes
• P(no)= probability of no

2. Gini Index:

• Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
• An attribute with the low Gini index should be preferred as compared to the high Gini
index.
• It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
• Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of tree
pruning technology used:

• Cost Complexity Pruning


• Reduced Error Pruning.

Advantages of the Decision Tree

• It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem.
• There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

• The decision tree contains lots of layers, which makes it complex.


• It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
• For more class labels, the computational complexity of the decision tree may increase.

Bias-Variance trade-off: -

It is important to understand prediction errors (bias and variance) when it comes to accuracy in
any machine-learning algorithm. There is a trade-off between a model’s ability to minimize

38
bias and variance which is referred to as the best solution for selecting a value of
Regularization constant. A proper understanding of these errors would help to avoid the
overfitting and underfitting of a data set while training the algorithm.

Bias: The bias is known as the difference between the prediction of the values by the Machine
Learning model and the correct value. Being high in biasing gives a large error in training as
well as testing data. It recommended that an algorithm should always be low-biased to avoid
the problem of underfitting. By high bias, the data predicted is in a straight-line format, thus
not fitting accurately in the data in the data set. Such fitting is known as the Underfitting of
Data. This happens when the hypothesis is too simple or linear in nature. Consider the graph
given below for an example of such a situation.

High Bias in the Model

In such a problem, a hypothesis looks like follows

Variance: - The variability of model prediction for a given data point which tells us the spread
of our data is called the variance of the model. The model with high variance has a very
complex fit to the training data and thus is not able to fit accurately on the data which it hasn’t
seen before. As a result, such models perform very well on training data but have high error
rates on test data. When a model is high on variance, it is then said to as Overfitting of Data.
Overfitting is fitting the training set accurately via complex curve and high order hypothesis
but is not the solution as the error with unseen data is high. While training a data model variance
should be kept low.

39
The high variance data looks as follows:

High Variance in the Model

Bias Variance Trade-off

If the algorithm is too simple (hypothesis with linear equation) then it may be on high bias and
low variance condition and thus is error-prone. If algorithms fit too complex (hypothesis with
high degree equation) then it may be on high variance and low bias. In the latter condition, the
new entries will not perform well. Well, there is something between both of these conditions,
known as a Trade-off or Bias Variance Trade-off. This trade-off in complexity is why there is
a trade-off between bias and variance. An algorithm can’t be more complex and less complex
at the same time. For the graph, the perfect tradeoff will be like this.

40
We try to optimize the value of the total error for the model by using the Bias-Variance
Tradeoff.

Total Error= Bias2 +Varience+Irreducible Error

The best fit will be given by the hypothesis on the trade-off point. The error to complexity
graph to show trade-off is given as –

Region for the Least Value of Total Error

This is referred to as the best point chosen for the training of the algorithm which gives low
error in training as well as testing data.

The train-test split method works:

41
Introduction to Cross-Validation

However, the train-split method has certain limitations. When the dataset is small, the
method is prone to high variance. Due to the random partition, the results can be entirely
different for different test sets. Why? Because in some partitions, samples that are easy to
classify get into the test set, while in others, the test set receives the ‘difficult’ ones.

To deal with this issue, we use cross-validation to evaluate the performance of a machine
learning model. In cross-validation, we don’t divide the dataset into training and test sets only
once. Instead, we repeatedly partition the dataset into smaller groups and then average the
performance in each group. That way, we reduce the impact of partition randomness on the
results.

Many cross-validation techniques define different ways to divide the dataset at hand. We’ll
discuss about the two most frequently used: the k-fold and the leave-one-out methods.

1. Leave-One-Out (LOO) cross-validation: -In the leave-one-out (LOO) cross-validation, we


train our machine-learning model times where is to our dataset’s size. Each time, only one
sample is used as a test set while the rest are used to train our model.
LOO is an extreme case of k-fold where {k=n}. If we apply LOO with an example, we’ll have
6 test subsets: S1={x1}, S2={x2}, S3={x3}, S4={x3}, S4={x4}, S5={x5}, S6={x6}
Iterating over them, we use S/Si as the training data in iteration i=1, 2,.….6, and evaluate the
model Si is:

The final performance estimate is the average of the six individual scores:

42
Score1+ Score1+ Score2+ Score3+ Score4+ Score5+ Score6
Overall Score = ---------------------------------------------------------------------------------------------------------
6

2. K-Fold Cross-Validation

In k-fold cross-validation, we first divide our dataset into k equally sized subsets. Then, we
repeat the train-test method k times such that each time one of the k subsets is used as a
test set and the rest k-1 subsets are used together as a training set. Finally, we compute the
estimate of the model’s performance estimate by averaging the scores over the k trials.

For example, let’s suppose that we have a dataset S={ x1, x2, x3, x4, x5, x6}containing 6 samples
and that we want to perform a 3-fold cross-validation.

First, we divide S into 3 subsets randomly.

S1={x1, x2 }, S2={x3, x4}, S3={x5, x6}

Then, we train and evaluate our machine-learning model 3 times. Each time, two subsets form
the training set, while the remaining one acts as the test set. In our example:

Finally, the overall performance is the average of the model’s performance scores on those
three test sets:

Score1+ Score1+ Score2+ Score3


Overall Score = --------------------------------------------------------------
3

Comparison of LOO & K-Fold:


An important factor when choosing between the k-fold and the LOO cross-validation methods
is the size of the dataset. When the size is small, LOO is more appropriate since it will use
more training samples in each iteration. That will enable our model to learn better
representations.
Conversely, we use k-fold cross-validation to train a model on a large dataset since LOO
trains n models, one per sample in the data. When our dataset contains a lot of samples, training
so many models will take too long. So, the k-fold cross-validation is more appropriate. Also,

43
in a large dataset, it is sufficient to use less than n folds since the test folds are large enough for
the estimates to be sufficiently precise.

Multi-Layer Perceptron Neural Network:

Multi-layer perception is also known as MLP. It is fully connected dense layers, which
transform any input dimension to the desired dimension. A multi-layer perception is a neural
network that has multiple layers. To create a neural network, we combine neurons together so
that the outputs of some neurons are inputs of other neurons.

A multi-layer perceptron has one input layer and for each input, there is one neuron (or node),
it has one output layer with a single node for each output and it can have any number of hidden
layers and each hidden layer can have any number of nodes. A multilayer perceptron (MLP)
Neural network belongs to the feedforward neural network. It is an Artificial Neural Network
in which all nodes are interconnected with nodes of different layers.

The word Perceptron was first defined by Frank Rosenblatt in his perceptron program.
Perceptron is a basic unit of an artificial neural network that defines the artificial neuron in the
neural network. It is a supervised learning algorithm that contains nodes’ values, activation
functions, inputs, and node weights to calculate the output.

The Multilayer Perceptron (MLP) Neural Network works only in the forward direction. All
nodes are fully connected to the network. Each node passes its value to the coming node only
in the forward direction. The MLP neural network uses a Backpropagation algorithm to
increase the accuracy of the training model.

Structure of MultiLayer Perceptron Neural Network


This netwok has three main layers that combine to form a complete Artificial Neural Network.
These layers are as follows:
Input Layer
It is the initial or starting layer of the Multilayer perceptron. It takes input from the training
data set and forwards it to the hidden layer. There are n input nodes in the input layer. The

44
number of input nodes depends on the number of dataset features. Each input vector variable
is distributed to each of the nodes of the hidden layer.

Hidden Layer

It is the heart of all Artificial neural networks. This layer comprises all computations of the
neural network. The edges of the hidden layer have weights multiplied by the node values. This
layer uses the activation function.

There can be one or two hidden layers in the model. Several hidden layer nodes should be
accurate as few nodes in the hidden layer make the model unable to work efficiently with
complex data. More nodes will result in an overfitting problem.

Output Layer

This layer gives the estimated output of the Neural Network. The number of nodes in the output
layer depends on the type of problem. For a single targeted variable, use one node. N
classification problem, ANN uses N nodes in the output layer.

Every node in the multi-layer perception uses a sigmoid activation function. The sigmoid
activation function takes real values as input and converts them to numbers between 0 and 1
using the sigmoid formula.

Steps of MultiLayer Perceptron Neural Network

• The input node represents the feature of the dataset.


• Each input node passes the vector input value to the hidden layer.
• In the hidden layer, each edge has some weight multiplied by the input variable. All the
production values from the hidden nodes are summed together. To generate the output
• The activation function is used in the hidden layer to identify the active nodes.
• The output is passed to the output layer.
• Calculate the difference between predicted and actual output at the output layer.
• The model uses backpropagation after calculating the predicted output.

Advantages of MultiLayer Perceptron Neural Network

1. MultiLayer Perceptron Neural Networks can easily work with non-linear problems.
2. It can handle complex problems while dealing with large datasets.
3. Developers use this model to deal with the fitness problem of Neural Networks.
4. It has a higher accuracy rate and reduces prediction error by using backpropagation.
5. After training the model, the Multilayer Perceptron Neural Network quickly predicts
the output.

Disadvantages of MultiLayer Perceptron Neural Network

1. This Neural Network consists of large computation, which sometimes increases the
overall cost of the model.
2. The model will perform well only when it is trained perfectly.
3. Due to this model’s tight connections, the number of parameters and node redundancy
increases.

45
Feed-Forward Neural Network:

A Feed Forward Neural Network is an artificial Neural Network in which the nodes are
connected circularly. A feed-forward neural network, in which some routes are cycled, is the
polar opposite of a Recurrent Neural Network. The feed-forward model is the basic type of
neural network because the input is only processed in one direction. The data always flows in
one direction and never backwards/opposite.

a Feed-Forward Neural Network is a single layer perceptron. A sequence of inputs enter the
layer and are multiplied by the weights in this model. The weighted input values are then
summed together to form a total. If the sum of the values is more than a predetermined
threshold, which is normally set at zero, the output value is usually 1, and if the sum is less
than the threshold, the output value is usually -1. The single-layer perceptron is a popular feed-
forward neural network model that is frequently used for classification. Single-layer
perceptrons can also contain machine learning features.

The neural network can compare the outputs of its nodes with the desired values using a
property known as the delta rule, allowing the network to alter its weights through training to
create more accurate output values. This training and learning procedure results in gradient
descent. The technique of updating weights in multi-layered perceptrons is virtually the same,
however, the process is referred to as back-propagation. In such circumstances, the output
values provided by the final layer are used to alter each hidden layer inside the network.

46
Unit 3: Unsupervised Learning Algorithms
Unsupervised Learning: clustering algorithms, k-means/k-medoid, hierarchical clustering, top-
down, bottom-up: single-linkage, multiple linkage, dimensionality reduction, principal
component analysis.
Case study: Customer Segmentation for E-commerce: An Unsupervised Learning

Contents
Unsupervised learning .......................................................................................................................................................... 1
Clustering algorithms ............................................................................................................................................................ 2
K-Means algorithm.................................................................................................................................................................. 3
Hierarchical clustering ........................................................................................................................................................ 13
Top-Down (Divisive) and Bottom-Up (Agglomerative) Hierarchical Clustering:....................................... 15
Single-Linkage and Multiple-Linkage: .......................................................................................................................... 16
Dimensionality reduction................................................................................................................................................... 18
Case Study: Customer Segmentation for E-commerce using Unsupervised Learning ............................. 34

Unsupervised learning
Unsupervised learning is a type of machine learning where the algorithm is given data without
explicit instructions on what to do with it. The system tries to learn the patterns and relationships
within the data on its own. There are two main types of unsupervised learning: clustering and
dimensionality reduction.
1. Clustering:
 K-Means Clustering: This algorithm partitions data into 'k' clusters based on
similarity. For example, in customer segmentation, you can use K-means to group
customers with similar purchasing behavior.
 Hierarchical Clustering: It creates a tree of clusters, where the root is a single
cluster containing all data points, and the leaves are individual data points. This
can be used in taxonomy or gene expression analysis.
2. Dimensionality Reduction:
 Principal Component Analysis (PCA): PCA is used to reduce the number of
features in a dataset while retaining its essential information. It's often applied in
image compression or feature extraction for machine learning models.
 t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is used for
visualizing high-dimensional data in two or three dimensions. It's useful for
exploring the structure of data and finding patterns. For instance, visualizing
similarities between words in natural language processing.
3. Association Rule Learning:
 Apriori Algorithm: This algorithm is used to discover associations between
different items in a dataset. For example, in retail, it can identify relationships like
"Customers who buy product A are likely to buy product B."

1
4. Generative Models:
 Generative Adversarial Networks (GANs): GANs consist of a generator and a
discriminator that are trained together. GANs can generate new data instances that
resemble the training data. They are used in image and video generation tasks.
5. Autoencoders:
 Variational Autoencoders (VAE): VAEs are a type of autoencoder that learns a
probabilistic mapping between the data space and a latent space. They are used
for generating new data points and can be applied to image and text generation.
Unsupervised learning is particularly valuable when you have a large amount of unlabeled data
and want to explore the underlying structure or relationships within it. It is widely used in
various domains, including pattern recognition, anomaly detection, and feature learning.

Clustering algorithms
Clustering algorithms in unsupervised learning aim to group similar data points together into
clusters or segments based on certain criteria. The goal is to discover hidden patterns or
structures within the data. There are various clustering algorithms, each with its own approach
and characteristics. Here are a few commonly used clustering algorithms:
1. K-Means Clustering:
 Objective: Partition the data into 'k' clusters based on similarity.
 Process: It starts by randomly selecting 'k' centroids (cluster centers) and assigns
each data point to the nearest centroid. Then, it recalculates the centroids based on
the mean of the data points in each cluster. This process iterates until
convergence.
 Example: Customer segmentation in marketing based on purchasing behavior.
2. Hierarchical Clustering:
 Objective: Create a tree-like structure (dendrogram) of clusters.
 Process: It begins with each data point as a separate cluster and merges the
closest clusters iteratively until all points belong to a single cluster. The resulting
dendrogram can be cut at different levels to obtain clusters of varying sizes.
 Example: Taxonomy in biology or organizational hierarchy.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
 Objective: Identify clusters based on the density of data points.
 Process: It defines clusters as dense regions separated by areas of lower point
density. It classifies points as core points, border points, or outliers (noise) based
on their density and proximity to other points.
 Example: Anomaly detection in network traffic where unusual patterns represent
potential security threats.
4. Mean Shift:
 Objective: Discover modes or peaks of high-density regions.
 Process: It iteratively shifts the center of a kernel until it converges to a high-
density region. The algorithm is adaptive and does not require specifying the
number of clusters beforehand.
 Example: Image segmentation where pixels with similar colors form clusters.

2
5. Agglomerative Clustering:
 Objective: Build clusters by successively merging or agglomerating data points.
 Process: It starts with each data point as a singleton cluster and merges the closest
pairs iteratively until a stopping criterion is met. The result is a dendrogram that
can be cut to form clusters.
 Example: Social network analysis to identify communities within a network.
6. Gaussian Mixture Model (GMM):
 Objective: Model the data as a mixture of several Gaussian distributions.
 Process: It assumes that the data is generated by a mixture of several Gaussian
distributions. The algorithm estimates the parameters of these distributions,
including means and covariances, to identify clusters.
 Example: Speech and handwriting recognition where multiple patterns contribute
to the observed data.

K-Means algorithm
The K-Means algorithm is a popular unsupervised machine learning algorithm used for
clustering data. The goal of K-Means is to partition a dataset into 'k' clusters, where each data
point belongs to the cluster with the nearest mean. Here's a detailed explanation of the K-Means
algorithm:
Key Concepts:
 Unsupervised Learning: It works with unlabeled data, finding patterns on its own.
 Clustering: It groups similar data points together into distinct clusters.
 Centroids: Each cluster has a central point (centroid), representing the "average" of data
points within it.
 Distance Metric: It measures how close data points are to each other, usually using
Euclidean distance.
Algorithm Steps:
Step 1: Initialization
 Input: Dataset with 'n' data points and the desired number of clusters 'k'.
 Process:
 Randomly select 'k' data points from the dataset as initial centroids.
Step 2: Assignment
 Input: Initial centroids.
 Process:
 For each data point in the dataset, calculate the Euclidean distance to each
centroid.
 Assign the data point to the cluster associated with the nearest centroid.

3
Step 3: Update
 Input: Assigned clusters.
 Process:
 Recalculate the centroids of each cluster by computing the mean of all data points
in that cluster.

Step 4: Convergence Check


 Input: Updated centroids.
 Process:
 Repeat the assignment and update steps iteratively until convergence.
 Convergence occurs when the centroids no longer change significantly or after a
fixed number of iterations.
Step 5: Result
 Output: Final clusters.
 Process:
 Once convergence is reached, the algorithm stops, and each data point is assigned
to a specific cluster.

Pseudo code:
1. Randomly initialize k centroids: c_1, c_2, ..., c_k
2. Repeat until convergence:
a. For each data point x_i, assign it to the cluster with the closest centroid:
j = argmin_k ||x_i - c_k||^2
b. Update centroids: c_k = (1/|cluster k|) * Σ x_i for all i in cluster k

4
Program 1: Implement a program to cluster a dataset using K-means clustering.

Example 1: iris flowers

5
Output:

Example 2: Pima Indians Diabetes Database

6
Output:

7
Example 3: Breast Cancer Wisconsin (Diagnostic) dataset

8
Output:

9
Program 2: Implement a program to calculate the elbow method for K-means clustering.

Output:

10
Program 3: Implement a program to calculate the silhouette coefficient for K-means
clustering.

11
Output:

Example:
Imagine a dataset of customer spending habits. K-means could group customers into clusters
based on similar spending patterns, helping businesses tailor marketing strategies.
Key Points:
 Simple and Efficient: K-means is popular due to its simplicity and efficiency.
 Sensitivity to Initialization: It can get stuck in local optima, so multiple runs with
different initializations are often recommended.
 Non-Spherical Clusters and Outliers: It struggles with non-spherical cluster shapes or
outliers.
 Choosing k: Selecting the optimal number of clusters is often challenging.
Applications:
 Customer Segmentation: Identifying customer groups with similar characteristics.
 Image Compression: Grouping similar pixels for compression.
 Anomaly Detection: Finding unusual data points that don't fit into clusters.
 Document Clustering: Grouping text documents based on content similarity.
 Gene Expression Analysis: Identifying patterns in gene expression data.

12
Hierarchical clustering
Hierarchical clustering is a method used in unsupervised machine learning to group similar data
points into clusters in a hierarchical manner. It builds a tree-like structure, called a dendrogram,
to represent the relationships between clusters. The two main approaches to hierarchical
clustering are agglomerative (bottom-up) and divisive (top-down).

Agglomerative Hierarchical Clustering:


1. Start: Treat each data point as a singleton cluster, and compute the pairwise distances
between all clusters.
2. Merge: Find the two closest clusters and merge them into a new cluster. Update the
distance matrix.
3. Repeat: Repeat step 2 until only a single cluster remains, forming a dendrogram.
4. Dendrogram Interpretation: The dendrogram can be cut at different heights to obtain
clusters at different levels of granularity.
Divisive Hierarchical Clustering:
1. Start: Treat the entire dataset as a single cluster.
2. Split: Identify the cluster with the maximum dissimilarity and split it into two clusters.
3. Repeat: Repeat step 2 until each data point forms a singleton cluster, creating a
dendrogram.
4. Dendrogram Interpretation: The dendrogram can be cut at different heights to obtain
clusters at different levels of granularity

Example:
Let's consider a small dataset for illustrative purposes:
Data Points: A, B, C, D, E
Distances:
- Dist(A, B) = 2
- Dist(A, C) = 3
- Dist(A, D) = 4
- Dist(A, E) = 5
- Dist(B, C) = 1
- Dist(B, D) = 6
- Dist(B, E) = 7
- Dist(C, D) = 8
- Dist(C, E) = 9
- Dist(D, E) = 10

Agglomerative Hierarchical Clustering:


1. Step 1: Treat each data point as a singleton cluster: {A}, {B}, {C}, {D}, {E}.
2. Step 2: Merge the closest clusters: {A, B}, {C}, {D}, {E} (Dist(A, B) = 2).
3. Step 3: Merge the closest clusters: {A, B}, {C, D}, {E} (Dist(B, C) = 1).
4. Step 4: Merge the closest clusters: {A, B, C, D}, {E} (Dist(C, D) = 8).
5. Step 5: Merge the closest clusters: {A, B, C, D, E}.

13
Dendrogram Interpretation:
The dendrogram would show the hierarchy of merging clusters at different heights, and we can
choose the height at which we want to cut the dendrogram to obtain clusters.
Divisive Hierarchical Clustering:
1. Step 1: Treat the entire dataset as a single cluster: {A, B, C, D, E}.
2. Step 2: Split the cluster into two: {A, B, C}, {D, E}.
3. Step 3: Split the cluster further: {A, B}, {C}, {D, E}.
4. Step 4: Split the cluster further: {A}, {B}, {C}, {D, E}.
5. Step 5: Split the cluster further: {A}, {B}, {C}, {D}, {E}.
Dendrogram Interpretation:
Similar to agglomerative clustering, the dendrogram in divisive clustering represents the
hierarchy of splitting clusters.
In practice, the choice of distance metric and linkage criteria (how to measure the distance
between clusters) influences the results of hierarchical clustering.

14
Top-Down (Divisive) and Bottom-Up (Agglomerative) Hierarchical
Clustering:
In the context of unsupervised learning algorithms, specifically hierarchical clustering, the terms
"top-down" and "bottom-up" refer to two different approaches for building the hierarchical
structure, and "single-linkage" and "multiple linkage" refer to different strategies for measuring
the distance between clusters.
Top-Down (Divisive) and Bottom-Up (Agglomerative) Hierarchical Clustering:
1. Top-Down (Divisive) Hierarchical Clustering:
 This approach starts with the entire dataset as one cluster and recursively splits it
into smaller clusters until each data point forms a singleton cluster.
 At each step, the algorithm selects a cluster and divides it into two.
2. Bottom-Up (Agglomerative) Hierarchical Clustering:

15
 This approach starts with each data point as a singleton cluster and merges the
closest clusters until only one cluster remains.
 At each step, the algorithm merges the two closest clusters.

Single-Linkage and Multiple-Linkage:


1. Single-Linkage (Nearest-Neighbor Linkage):
 In single-linkage hierarchical clustering, the distance between two clusters is
defined as the shortest distance between any two points in the two clusters.
 The linkage criterion is based on the minimum distance between points in
different clusters.
2. Complete-Linkage (Farthest-Neighbor Linkage):
 In complete-linkage hierarchical clustering, the distance between two clusters is
defined as the longest distance between any two points in the two clusters.
 The linkage criterion is based on the maximum distance between points in
different clusters.
3. Average-Linkage:
 In average-linkage hierarchical clustering, the distance between two clusters is
defined as the average distance between all pairs of points in the two clusters.
 The linkage criterion is based on the average distance between points in different
clusters.
4. Ward's Method:
 Ward's method is another linkage criterion that minimizes the variance within
clusters when merging them.
 It tends to produce more balanced and compact clusters.

16

17
Dimensionality reduction
Dimensionality reduction is a technique used in machine learning and data analysis to reduce the
number of input features or variables while preserving the important information in the data.
This is often done to mitigate the curse of dimensionality, improve computational efficiency, and
avoid over fitting. Here are some popular algorithms for dimensionality reduction:
Key Algorithms:
1. Principal Component Analysis (PCA) (1901):
 Identifies orthogonal directions of maximum variance in the data.
 Projects data onto these principal components, creating lower-dimensional
representations.
 Widely used due to simplicity and effectiveness.
2. Linear Discriminant Analysis (LDA) (1936):
 Supervised method that considers class labels during projection.
 Maximizes class separability in the reduced space.
 Often used for classification tasks.
3. Factor Analysis (1904):
 Assumes underlying latent variables explain observed correlations between features.
 Models these latent factors to reduce dimensionality.
 Commonly used in social sciences and psychometrics.
4. t-Distributed Stochastic Neighbor Embedding (t-SNE) (2008):
 Non-linear technique for visualizing high-dimensional data in 2D or 3D.
 Preserves local structure while revealing global patterns.
 Popular for visual exploration of complex datasets.
5. Singular Value Decomposition (SVD) (1873):
 Matrix factorization technique with various applications, including dimensionality
reduction.
 Decomposes a matrix into three matrices: U, Σ, and V*.
 Truncated SVD approximates the original matrix with fewer dimensions.
6. Auto encoders (1980s - 1990s):
 Neural networks trained to reconstruct their input data.
 Learn compressed representations in the hidden layers.
 Flexible for non-linear dimensionality reduction.
7. Independent Component Analysis (ICA) (1990s):
 Finds independent components within the data.
 Useful for signal separation and blind source separation.
8. Random Projection (2006):
 Projects data onto a lower-dimensional subspace using random matrices.
 Often computationally efficient and preserves pairwise distances well.
9. Uniform Manifold Approximation and Projection (UMAP) (2018):
 Non-linear technique preserving global structure while revealing local patterns.
 Often used for visualization and clustering, often outperforming t-SNE.
Choosing the Right Algorithm:
 Data characteristics: linear vs. non-linear relationships, noise levels, etc.
 Purpose: visualization, compression, classification, feature selection, etc.
 Interpretability requirements: some algorithms yield more interpretable results.
 Computational efficiency: consider algorithm speed and scalability.

18
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a dimensionality reduction technique widely
used in machine learning and data analysis. Its primary goal is to transform high-
dimensional data into a lower-dimensional representation while retaining as much of the
original variability as possible. PCA achieves this by identifying the directions (principal
components) in the data along which the variance is maximized.
Key Concepts:
1. Covariance Matrix:
 PCA starts by computing the covariance matrix of the input data. The covariance
matrix captures the relationships between different features, indicating how they
vary together.
2. Eigenvalues and Eigenvectors:
 PCA then calculates the eigenvalues and corresponding eigenvectors of the
covariance matrix. Eigenvectors represent the directions of maximum variance,
and eigenvalues indicate the magnitude of variance along those directions.
3. Principal Components:
 The eigenvectors become the principal components of the data. These are the new
coordinate axes in the transformed space. The first principal component
corresponds to the direction of maximum variance, the second to the second-
highest variance, and so on.
4. Explained Variance:
 Each eigenvalue represents the amount of variance captured by its corresponding
principal component. The total variance of the data remains constant, but PCA
allows for prioritizing the most important dimensions.
5. Dimensionality Reduction:
 The principal components are ranked by their corresponding eigenvalues. By
selecting the top-k principal components (where k is the desired reduced
dimensionality), you can create a lower-dimensional representation of the data.
Steps in PCA:
1. Standardization:
 Standardize the input features (subtract mean and divide by the standard
deviation) to ensure that each feature contributes equally to the analysis.
2. Covariance Matrix Calculation:
 Calculate the covariance matrix of the standardized data.
3. Eigen decomposition:
 Perform Eigen decomposition on the covariance matrix to obtain eigenvalues and
eigenvectors.
4. Select Principal Components:
 Rank the eigenvalues in descending order and choose the top-k eigenvectors to
form the principal components matrix.
5. Data Transformation:
 Project the original data onto the selected principal components to obtain the
lower-dimensional representation.

19
20
Output:

21
Program 1: Implement a program to perform principal component analysis on a dataset.

22
Output:

23
24
Output:

Program 2: Implement a program to calculate the covariance matrix for a dataset.

25
Program 3: Implement a program to calculate the singular value decomposition for a
dataset.

26
Hierarchical clustering:
Program 1: Implement a program to perform hierarchical clustering on a dataset.

27
Program 2: Implement a program to calculate the agglomerative clustering algorithm for a
hierarchical clustering.

28
Program 3: Implement a program to calculate the divisive clustering algorithm for a hierarchical
clustering.

29
Output:

30
Anomaly detection:

Program 1: Implement a program to detect anomalies in a dataset using the isolation forest
algorithm.

Output:

31
Program 2: Implement a program to detect anomalies in a dataset using the one-class support
vector machine.

Output:

32
Program 3: Implement a program to evaluate the performance of an anomaly detection
algorithm.

33
Case Study: Customer Segmentation for E-commerce using Unsupervised
Learning
Introduction:
In this case study, we will explore how unsupervised learning techniques, specifically clustering
algorithms, can be applied to perform customer segmentation for an e-commerce business.
Customer segmentation helps businesses understand the diverse needs and behaviors of their
customers, enabling personalized marketing strategies, product recommendations, and improved
customer experiences.
Objective:
The goal is to identify distinct groups of customers based on their purchasing behavior,
preferences, and engagement with the e-commerce platform. This segmentation can provide
valuable insights for targeted marketing campaigns, product recommendations, and tailored
services.
Dataset:
Assume we have a dataset with the following features:
1. Customer ID: Unique identifier for each customer.
2. Purchase History: Information about the products purchased, including frequency,
recency, and monetary value.
3. Website Engagement: Data related to customer engagement, such as time spent on the
website, number of visits, etc.
4. Demographic Information: Age, gender, location, etc.
Steps:
1. Data Preprocessing:
 Handle missing values, if any.
 Standardize or normalize numerical features.
 Encode categorical variables.
2. Exploratory Data Analysis (EDA):
 Understand the distribution of each feature.
 Explore correlations between features.
 Identify outliers and decide whether to handle or remove them.
3. Feature Engineering:
 Create relevant features for analysis.
 Combine or transform features if needed.
4. Unsupervised Learning (Clustering):
 Apply clustering algorithms such as K-means, hierarchical clustering, or
DBSCAN.
 Choose the appropriate number of clusters based on the data and business
understanding.
 Analyze and interpret the results of clustering.
5. Customer Segmentation:
 Identify and label the segments created by the clustering algorithm.
 Analyze the characteristics of each segment.
 Understand the differences and similarities between segments.
6. Business Insights:
 Derive actionable insights for marketing, product development, and customer
engagement strategies based on the identified segments.

34
 Tailor marketing campaigns to address the specific needs of each segment.
 Optimize product recommendations and pricing strategies.
7. Evaluation:
 Evaluate the effectiveness of customer segmentation by monitoring key
performance indicators (KPIs) over time.
 Refine the segmentation approach if necessary.
Tools and Technologies:
 Python for data preprocessing, analysis, and visualization.
 Scikit-learn or other machine learning libraries for clustering algorithms.
 Matplotlib and Seaborn for data visualization.
Conclusion:
Customer segmentation using unsupervised learning provides businesses with a powerful tool to
enhance customer understanding and optimize marketing strategies. By leveraging the insights
gained from segmentation, e-commerce businesses can foster customer loyalty, improve
customer satisfaction, and drive overall business success.

35
Unit 4: Ensemble Algorithms
Bagging, Boosting, Random forests, AdaBoost,
Gradient boosting.
Case Study: Credit Card Fraud Detection:
Ensemble Learning

Ensemble Methods

4.1 Introduction
• In broad terms, using ensemble methods is about combining models to an ensemble
such that the ensemble has a better performance than an individual model on average.
• The main categories of ensemble methods involve (1) voting schemes among high-
variance models to prevent “outlier” predictions and overfitting, and (2) the other
involves boosting “weak learners” to become “strong learners.”

4.2 Majority Voting


• We will use the term “majority” throughout this lecture in the context of voting to
refer to both majority and plurality voting.1
• Plurality: mode, the class that receives the most votes; for binary classification, ma-
jority and plurality are the same

Unanimity

Majority

Plurality
Figure 1: Illustration of unanimity, majority, and plurality voting
1
Training set

New data
Classification
models
h1 h2 ... hn

Predictions y1 y2 ... yn

Voting

Final prediction yf

Figure 2: Illustration of the majority voting concept. Here, assume n different classifiers,
{h1 , h2 , ..., hn }, where hi (x) = ŷi .

In lecture 2, (Nearest Neighbor Methods) we learned that the majority (or plurality) voting
can be expressed more simply as the mode:

ŷf = mode{h1 (x), h2 (x), ...hn (x)}. (1)

The following illustration demonstrates why majority voting can be effective (under certain
assumptions).

• Given are n independent classifiers (h1 , ..., hn ) with a base error rate ;
• Here, independent means that the errors are uncorrelated;
• Assume a binary classification task (there are two unique class labels).

Assuming the error rate is better than random guessing (i.e., lower than 0.5 for binary
classification),

∀i ∈ {1 , 2 , ..., n }, i < 0.5, (2)

the error of the ensemble can be computed using a binomial probability distribution since
the ensemble makes a wrong prediction if more than 50% of the n classifiers make a wrong
prediction.
The probability that we make a wrong prediction via the ensemble if k classifiers predict
the same class label (where k > dn/2e because of majority voting ) is then
 
n k
P (k) =  (1 − )n−k , (3)
k

n

where k is the binomial coefficient
 
n n!
= . (4)
k (n − k)!k!
However, we need to consider all cases k ∈ {dn/2e, ..., n} (cumulative prob. distribution) to
compute the ensemble error

n  
X n
ens = k (1 − )n−k . (5)
k
k

Consider the following example with n=11 and  = 0.25, where the ensemble error decreases
substantially compared to the error rate of the individual models:

11  
X 11
ens = 0.25k (1 − 0.25)11−k = 0.034. (6)
k
k=6

1.0 Ensemble error


Base error
0.8

0.6
Error rate

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0
Base error

Figure 3: error-rate

4.3 Soft Majority Voting


For well calibrated classifiers2 we can also use the predicted class membership probabilities
to infer the class label,

n
X
ŷ = arg max wi pi,j , (7)
j
i=1

where pi,j is the predicted class membership probability for class label j by the ith classifier.
Here wi is an optional weighting parameter. If we set
wi = 1/n, ∀wi ∈ {w1 , ..., wn },
then all probabilities are weighted uniformly.
To illustrate this, let us assume we have a binary classification problem with class labels
j ∈ {0, 1} and an ensemble of three classifiers hi (i ∈ {1, 2, 3}):

h1 (x) → [0.9, 0.1], h2 (x) → [0.8, 0.2], h3 (x) → [0.4, 0.6]. (8)
We can then calculate the individual class membership probabilities as follows:

p(j = 0|x) = 0.2·0.9+0.2·0.8+0.6·0.4 = 0.58, p(j = 1|x) = 0.2·0.1+0.2·0.2+0.6·0.6 = 0.42.


(9)
The predicted class label is then
 
ŷ = arg max p(j = 0|x), p(j = 1|x) = 0. (10)
j

4.4 Bagging
• Bagging relies on a concept similar to majority voting but uses the same learning
algorithm (typically a decision tree algorithm) to fit models on different subsets of the
training data (bootstrap samples).
• In a nutshell, a boostrap sample is a sample of size n drawn with replacement from
an original training set D with |D| = n. Consequently, some training examples are
duplicated in each bootstrap sample, and some other training examples do not appear
in a given bootstrap sample at all (usually, we refer to these examples as ”out-of-bag
sample”). This is illustrated in Figure 5. We will revisit bootstrapping later in the
model evaluation lectures.
• In the limit, there are approx. 63.2% unique examples in a given boostrap sample.
Consequently, 37.2% examples from the original training set won’t appear in a given
boostrap sample at all. The justification is illustrated below in Eq. 11 to Eq. 13.
• Bagging can improve the accuracy of unstable models that tend to overfit3 .

Algorithm 1 Bagging
1: Let n be the number of bootstrap samples
2:
3: for i=1 to n do
4: Draw bootstrap sample of size m, Di
5: Train base classifier hi on Di
6: ŷ = mode{h1 (x), ..., hn (x)}

Figure 4: The bagging algorithm.

Original Dataset x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

Bootstrap 1 x8 x6 x2 x9 x5 x8 x1 x4 x8 x2 x3 x7 x10

Bootstrap 2 x10 x1 x3 x5 x1 x7 x4 x2 x1 x8 x6 x9

Bootstrap 3 x6 x5 x4 x1 x2 x4 x2 x6 x9 x2 x3 x7 x8 x10

Training Sets Test Sets

Figure 5: Illustration of bootstrap sampling


• If we sample from a uniform distribution, we can compute the probability that a given
example from a dataset of size n is not drawn as a bootstrap sample as

 n
1
P (not chosen) = 1− , (11)
n

which is asymptotically equivalent to


1
≈ 0.368 as n → ∞. (12)
e

Vice versa, we can then compute the probability that a sample is chosen as

 n
1
P (chosen) = 1 − 1 − ≈ 0.632. (13)
n

Figure 6: Proportion of unique training examples in a bootstrap sample.


Training Bagging Bagging
example round 1 round 2 …
indices
1 2 7 …
2 2 3 …
3 1 2 …
4 3 1 …
5 7 1 …
6 2 7 …
7 4 7 …

h1 h2 hn

Figure 7: Illustration of bootstrapping in the context of bagging.

Training set

Bootstrap T1 Tn
samples T22
T ...
New data

Classification h1 h2 ... hn
models

Predictions y1 y2 ... yn

Voting

Final prediction yf

Figure 8: The concept of bagging. Here, assume n different classifiers, {h1 , h2 , ..., hn } where
hi (x) = ŷi .

4.5 Bias and Variance Intuition


• “Bias and variance” will be discussed in more detail in the next lecture, where we
will decompose loss functions into their variance and bias components and see how it
relates to overfitting and underfitting.
Low Variance High Variance
(Precise) (Not Precise)

(Accurate)
Low Bias
(Not Accurate)
High Bias

Figure 9: Bias and variance intuition.

• One can say that individual, unpruned decision tree “have high variance” (in this
context, the individual decision trees “tend to overfit”); a bagging model has a lower
variance than the individual trees and is less prone to overfitting – again, bias and
variance decomposition will be discussed in more detail next lecture.

4.6 Boosting
• There are two broad categories of boosting: Adaptive boosting and gradient boosting.
• Adaptive and gradient boosting rely on the same concept of boosting “weak learners”
(such as decision tree stumps) to “strong learners.”

• Here, a decision tree


 stump refers to a decision tree that only has one level, i.e.,
h(x) = s 1(xk ≥ t) , where s(x) ∈ {−1, 1} and k ∈ {1, ..., K} (K is the number of
features).
• Boosting is an iterative process, where the training set is reweighted, at each iteration,
based on mistakes a weak leaner made (i.e., misclassifications); the two approaches,
adaptive and gradient boosting, differ mainly regarding how the weights are updated
and how the classifiers are combined.
General Boosting

Training Sample h 1(x)

Weighted h 2(x)
Training Sample

Weighted h m(x)
Training Sample

(∑ j j )
m
h m(x) = sig n w h (x) for h (x) ∈ {−1,1}
j= 1

or h m(x) = arg max (∑ wj I[h j(x) = i]) for h (x) = i,


m
i ∈ {1,...,n}
i
j= 1

Figure 10: A general outline of the


37
boosting procedure for n iterations.

Intuitively, we can outline the general boosting procedure for AdaBoost is as follows:

• Initialize a weight vector with uniform weights

• Loop:

– Apply weak learner to weighted training examples (instead of orig. training set,
may draw bootstrap samples with weighted probability)
– Increase weight for misclassified examples

• (Weighted) majority voting on trained classifiers

4.6.1 AdaBoost (Adaptive Boosting)

• AdaBoost (short for ”adaptive boosting”) was initially described by Freund and Schapire
in 1997 .

Figure 11: AdaBoost algorithm.


In Figure 11, 1 (hr (i) 6= yi ) refers to the 0/1 loss,
(
0 if hr (i) = yi
1 (hr (i) 6= yi ) = .
1 if hr (i) 6= yi

The early stopping criterion if r > 1/2 then stop in Figure 11 assumes a binary classifi-
cation setting. Otherwise, it can be generalized to r > 1 − (1/c), where c is the number of
unique class labels in the training set. For the general case of c classes, you need to extend
the algorithm on line 10 by replacing

αr := log[(1 − r )/r ]

with

αr := log[(1 − r )/r ] + log(c − 1)

as described in ”Multi-class AdaBoost” .


1 2

x2 x2

x1 x1

3 4

x2 x2

x1 x1

Figure 12: Illustration of the AdaBoost algorithms for three iterations on a toy dataset. The size
of the symbols (circles shall represent training examples from one class, and triangles shall represent
training examples from another class) is proportional to the weighting of the training examples at
each round. The 4th subpanel shows a combination/ensemble of the hypotheses from subpanels
1-3.

7.6.2 Gradient Boosting

Similar to AdaBoost, gradient boosting6 is an ensemble method that combines multiple


weak learners (such as decision tree stumps) into a strong learner (here, a weak learner is
an algorithm that produces models with a low predictive performance; vice versa, a strong
learner is a learning algorithm that produces models with a high predictive performance).
Also, similar to AdaBoost (and in contrast to Bagging), gradient boosting is a sequential
(rather than parallel) algorithm – it is powerful but rather expensive to train7 .
In contrast to AdaBoost, we do not adjust the weights for the training examples that have
been either misclassified or correctly classified. Also, we do not compute a weight for each
model in the sequential gradient boosting ensemble. Instead, in gradient boosting, we op-
timize a differentiable loss function (e.g., mean-squared-error for regression or negative log-
likelihood for classification) of a weak learner via consecutive rounds of boosting. The
output of gradient boosting is an additive models of multiple weak learners (in contrast to
AdaBoost, we do not apply majority voting to the ensemble of models).
Note that in gradient boosting, we may use decision trees (the most common base learning
algorithm for gradient boosting is in fact a decision tree algorithm), but we do not necessarily
use decision tree stumps. The first model in gradient boosting is basically just a decision
tree’s root node (on which we do majority voting in classification or averaging in regression).
The consequent models in gradient boosting are deeper decision trees. The depth is usually
determined by the practitioner. Values like 8 or 32 are common, depending on the complexity
of the dataset/difficulty of the task – the max. tree depth is a hyperparameter to be tuned.
Conceptually, the main idea behind gradient boosting can be summarized via the following
three steps:
1. Construct a base tree (just the root node)

2. Build next tree based on errors of the previous tree


3. Combine tree from step 1 with trees from step 2. Go to step 2.

For the sake of simplicity, the following paragraphs will illustrate gradient boosting using
a regression (not classification) example. At the end of this section, we will briefly look
at how we can apply gradient boosting to decision tree classifiers – we won’t go into too
much detail, about classification specifics here, because we haven’t covered (log-)likelihood
maximization and logistic regression, yet.
Suppose you have the following dataset (Table 1):

x1 # Rooms x2 =City x3 =Age y=Price (in million US Dollars)


5 Boston 30 1.5
10 Madison 20 0.5
6 Lansing 20 0.25
5 Waunakee 10 0.1

Table 1: Example dataset for illustrating gradient boosting.

Step 1. The first step is to construct the root node. As we remember from the decision tree
lecture, making a prediction based on all training examples at a given node in a decision tree
regressor is basically just computing the average target value. Hence, step 1 is computing
the prediction

n
1 X (i)
yˆ1 = y = 0.5875
n i=1

given the dataset in Table 1.

Step 2. For the second step, we first compute the so-called pseudo residuals. These pseudo
residuals are basically the difference between the true target value and the predicted target
value. For the regression case, the residual based on the output from step 1 is simply

r1 = y1 − yˆ1 .

Note that we use the term ”pseudo” residual, because there are different ways to compute
these differences and the term ”residual” has a specific meaning in regression analysis.
Now, after computing the residuals for all training examples, we can add a new column to
our dataset, which we refer to as r1 . This is shown in Table 2.

x1 # Rooms x2 =City x3 =Age y=Price r1 =Res


5 Boston 30 1.5 1.5 - 0.5875 = 0.9125
10 Madison 20 0.5 0.5 - 0.5875 = -0.0875
6 Lansing 20 0.25 0.25 - 0.5875 = -0.3375
5 Waunakee 10 0.1 0.1 - 0.5875 = -0.4875

Table 2: Example dataset for illustrating gradient boosting, including the residuals after Step 1.
previous tree
Then, create a tree based on x1, . . . , xm to fit the residuals
x1# x2=City x3=Age y=Price r1=Residual
Rooms
5 Boston 30 1.5 1.5 - 0.5875 = 0.9125
10to fitMadison
The next part of step 2 is a new decision20 tree to the 0.58-. 0.5875
0.5 residuals = -0.0875
Note that we usually
limit the depth of this decision tree (e.g., setting a max depth, which is a hyperparameter).
6 Lansing 20 0.25 0.25 - 0.5875 = -0.3375
An example decision tree fit to these residuals may look like as shown in Figure 13.
5 Waunake 10 0.1 0.1 - 0.5875 = -0.4875
e
Age >= 30
No Yes
# Rooms >= 10 0.9125

-0.3375 -0.0875
-0.4125
-0.4875

Sebastian Raschka STAT 479: Machine Learning FS 2019 63


Figure 13: A decision tree fit to the residuals r1 . Note that in the left-most node, the residuals
are averaged as it is typical in regression trees where we have multiple training examples at a given
node.

Step 3. In this third step, we combine the tree from step 1 (the root node) and the tree
from step 2. For example, if we were to make a prediction for ”Lansing” based on the dataset
in Table 1, the prediction would be

ŷLansing = 0.5875 + α × (−0.4125).

Here, 0.5875 is the value predicted by the root node from step 1. And −0.4125 is the residual
(r1 value) from step 2. The coefficient α is a learning rate or step size parameter in the
range [0, 1] – it helps to not add the full residual but a rescaled version (a smaller ”step”).
A typical value for α would be 0.01 or 0.1, but this is also a hyperparameter that needs to
be tuned manually. Empirically, choosing an alpha value that is too large will likely result
in a gradient boosting model with high variance (a model that tends to overfit).
After step 3, we would now go back to step 2 and repeat the procedure (step 2 and step
3) T times. In each round, we take the predictions from step 3 and compute a new set
of residuals for fitting the next tree. For example, for the ”Lansing” example above with
alpha = 0.1, the new residual r2 would be

r2,Lansing = yLansing − ŷLansing = 0.25 − 0.5875 + 0.1 × (−0.4125) = −0.37875.

We would do this for all training examples and add a new column r2 to Table 2. Then, we
would fit a tree on these residuals, and so forth.
The general algorithm is summarized below.
Here, ri,t is the pseudo residual of the t-th tree and i-th example.
In a regression case, the differentiable loss function in Line 3 could be the squared error:

1 (i) 2
SSE 0 = y − h(x(i) ) .
2

(We add a 12 scaling factor because it cancels nicely in the partial derivative). Via the chain
rule, we can differentiate it as follows, with respect to the prediction:
Algorithm 1 Gradient Boosting
1: Initialize T : the number of trees for gradient boosting rounds
(i) (i) n
2: Initialize D: the training
 dataset, {hx , y i}i=1
(i) (i)
3: Choose L y , h(x ) , a differentiable loss function
Pn (i)

4: Step 1: Initialize model h0 (x) = argmin i=1 L y , ŷ [root node]

5: Step 2:
6: for t=1 to T do  
∂L(y (i) ,h(x(i) ))
7: A. Compute pseudo residual ri,t = − ∂h(x(i) )
, for i = 1 to n
h(x)=ht−1 (x)
8: B. Fit tree to rit values, and create terminal nodes Rj,t for j = 1, ..., Jt .
9: C.
10: for j=1 to Jt do P
ŷj,t = argmin x(i) ∈Ri,j L y (i) , ht−1 (x(i) ) + ŷ

11:

PJt 
12: D. Update ht (x) = ht−1 (x) + α j=1 ŷj,t I x ∈ Rj,t
13: Step 3: Return ht (x)

∂ 1 (i) 2 1
y − h(x(i) ) = 2 × y (i) − h(x(i) ) × (0 − 1)

(i)
(14)
∂h(x ) 2 2
= − y − h(x(i) ) .
(i)

(15)

Here, the expression − y (i) − h(x(i) ) is the negative residual.




Note that for the final prediction, we combine all T trees:

h0 (x) + α ŷj,t=1 + ... + α ŷj,T .

We can apply a similar concept for fitting a gradient boosting model for classification. The
main algorithm is the same, except that we use a different loss function (minimizing the
negative log-likelihood). Also, there are minor details for scaling the ”pseudo residuals.”
However, since we haven’t talked about logistic regression and negative log-likelihood opti-
mization, this is out-of-scope at this point.

4.7 Random Forests

4.7.1 Overview

• Random forests are among the most widely used machine learning algorithm, probably
due to their relatively good performance “out of the box” and ease of use (not much
tuning required to get good results).
• In the context of bagging, random forests are relatively easy to understand conceptu-
ally: the random forest algorithm can be understood as bagging with decision trees,
but instead of growing the decision trees by basing the splitting criterion on the com-
plete feature set, we use random feature subsets.
• To summarize, in random forests, we fit decision trees on different bootstrap samples,
and in addition, for each decision tree, we select a random subset of features at each
node to decide upon the optimal split; while the size of the feature subset to consider
at each node is a hyperparameter that we can tune, a “rule-of-thumb” suggestion is
to use NumFeatures = log2 m + 1.
4.7.2 Does random forest select a subset of features for every tree or every
node?

Earlier random decision forests by Tin Kam Ho9 used the “random subspace method,” where
each tree got a random subset of features.

“The essence of the method is to build multiple trees in randomly selected sub-
spaces of the feature space.” – Tin Kam Ho

However, a few years later, Leo Breiman described the procedure of selecting different subsets
of features for each node (while a tree was given the full set of features) — Leo Breiman’s
formulation has become the “trademark” random forest algorithm that we typically refer to
these days when we speak of “random forest10 :”

“. . . random forest with random features is formed by selecting at random, at


each node, a small group of input variables to split on.”

4.7.3 Generalization Error

• The reason why random forests may work better in practice than a regular bagging
model, for example, may be explained by the additional randomization that further
diversifies the individual trees (i.e., decorrelates them).
• In Breiman’s random forest paper, the upper bound of the generalization error is given
as

ρ̄ · (1 − s2 )
PE ≤ , (16)
s2

where ρ̄ is the average correlation among trees and s measures the strength of the trees as
classifiers. I.e., the average predictive performance concerning the classifiers’ margin. We
do not need to get into the details of how p̄ and s are calculated to get an intuition for
their relationship. I.e., the lower correlation, the lower the error. Similarly, the higher the
strength of the ensemble or trees, the lower the error. So, randomization of the feature
subspaces may decrease the “strength” of the individual trees, but at the same time, it
reduces the correlation of the trees. Then, compared to bagging, random forests may be at
the sweet spot where the correlation and strength decreases result in a better net result.

4.7.4 Feature Importance via Random Forests

While random forests are naturally less interpretable than individual decision trees, where
we can trace a decision via a rule sets, it is possible (and common) to compute the so-called
“feature importance” of the inputs – that means, we can infer how important a feature is for
the overall prediction. However, this is a topic that will be discussed later in the “Feature
Selection” lecture.
4.7.5 Extremely Randomized Trees (ExtraTrees)

• A few years after random forests were developed, an even “more random” procedure
was developed called Extremely Randomized Trees 11 .
• Compared to regular random forests, the ExtraTrees algorithm selects a random fea-
ture at each decision tree nodes for splitting; hence, it is very fast because there is no
information gain computation and feature comparison step.
• Intuitively, one might say that ExtraTrees have another “random component” (com-
pared to random forests) to further reduce the correlation among trees – however, it
might decrease the strength of the individual trees (if you think back of the general-
ization error bound discussed in the previous section on random forests).

4.8 Stacking

4.8.1 Overview

• Stacking12 is a special case of ensembling where we combine an ensemble of models


through a so-called meta-classifier.

• In general, in stacking, we have “base learners” that learn from the initial training
set, and the resulting models then make predictions that serve as input features to a
“meta-learner.”

4.8.2 Naive Stacking

Algorithm 1 ”Naive” Stacking


1: Input: Training set D = {〈x[1] , y[1] 〉, ..., 〈x[n] , y[n] 〉}
2: Output: Ensemble classifier hE
3:
4: Step 1: Learn base-classifiers
5: for t ← 1 to T do
6: Fit base model ht on D
7: Step 2: construct new dataset D′ from D
8: for i ← 1 to n do
[i] [i]
9: add 〈x′ , y[i] 〉 to new dataset, where x′ = {h1 (x[i] ), ..., hT (x[i] )}
10: Step 3: learn meta-classifier hE
11: return hE (D′ )

Figure 14: Outline of the naive stacking algorithm.


Training set

New data
Classification
models
h1 h2 ... hn

Predictions y1 y2 ... yn

Meta-Classifier

Final prediction
yf

Figure 15: This figure illustrates basic concept of stacking, which is relatively similar to the voting
classifier at the beginning of this lecture. Note that here, in contrast to majority voting, we have a
meta-classifier that takes the predictions of the models produced by the base learners (h1 ...hn ) as
inputs.

The problem with the naive stacking algorithm outlined in Fig. 14 and Fig. 15 is that it has a
high tendency to suffer from extensive overfitting. The reason for a potentially high degree
of overfitting is that if the base learners overfit, then the meta-classifier heavily relies on
these predictions made by the base-classifiers. A better alternative would be to use stacking
with k -fold cross-validation or leave-one-out cross-validation.

A Validation Training
Fold Fold

1st Performance 1
K Iterations (K-Folds)

2nd Performance 2

3rd Performance 3 Performance


5
1
Performance 4
= 5 ∑ Performance i
4th i =1

5th Performance 5

B C
Training Fold Data
Prediction
Validation
Training Fold Labels Fold Data

Performance
Model

Hyperparameter
Values Validation
Fold Labels
Learning
Algorithm

Model

Figure 16: Illustration of k -fold cross-validation


Training evaluation

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

Figure 17: Illustration of leave-one-out cross-validation, which is a special case of k -fold cross-
validation, where k = n (where n is the number of examples in the training set).

7.8.3 Stacking with Cross-Validation

• The use of cross-validation (or leave-one-out cross-validation) is highly recommended


for performing stacking, to avoid overfitting.
Algorithm 1 Stacking with cross-validation
1: Input: Training set D = {〈x[1] , y[1] 〉, ..., 〈x[n] , y[n] 〉}
2: Output: Ensemble classifier hE
3:
4: Step 1: Learn base-classifiers
5: Construct new dataset D′ = {}
6: Randomly split D into k equal-size subsets: D = {D1 , ..., Dk }
7: for j ← 1 to k do
8: for t ← 1 to T do
9: Fit base model ht on D \Dk
10: for i ← 1 to n ∈ |D \Dk | do
[i] [i]
11: Add 〈x′ , y[i] 〉 to new dataset D′ , where x′ = {h1 (x[i] ), ..., hT (x[i] )}
12: Step 3: learn meta-classifier hE
13: return hE (D′ )

Figure 18: Stacking with k -fold cross-validation

Training set

Training folds Validation fold


Repeat k times

h1 h2 ... hn Base Classifiers

y1 y2 ... yn
Level-1 predictions
in k-th iteration

All level-1 predictions


Train

Meta-Classifier

yf Final prediction

1
Figure 19: Illustration of stacking with cross-validation.
Title: Enhancing Credit Card Fraud Detection Using Ensemble Learning: A Case Study

Abstract:

Credit card fraud poses a significant challenge to financial institutions, requiring advanced
techniques for timely detection and prevention. This case study explores the application of
ensemble learning methods to enhance the accuracy and robustness of credit card fraud detection
systems. Ensemble learning combines multiple models to create a stronger, more reliable
predictive system. In this study, we leverage various algorithms such as Random Forest,
Gradient Boosting, and Bagging, and evaluate their performance in comparison to individual
models.

1. Introduction:

Credit card fraud is a prevalent issue affecting both financial institutions and cardholders.
Traditional methods of fraud detection often struggle to keep up with evolving tactics employed
by fraudsters. Ensemble learning, which combines the strengths of multiple models, offers a
promising solution to enhance fraud detection accuracy.

2. Dataset:

A comprehensive dataset containing historical credit card transactions is used for training and
testing the models. The dataset includes features such as transaction amount, location, time, and
various transaction patterns. A subset of the data contains instances of fraudulent transactions,
providing a balanced representation of both classes.

3. Ensemble Learning Algorithms:

Several ensemble learning algorithms are implemented and compared, including Random Forest,
Gradient Boosting, and Bagging. Each algorithm is fine-tuned to optimize performance and
reduce false positives and false negatives.

4. Feature Engineering:

To improve the effectiveness of the models, feature engineering techniques are applied to extract
meaningful information from the dataset. This includes normalizing transaction amounts,
encoding categorical variables, and extracting temporal features.

5. Model Evaluation:

The ensemble models are evaluated using standard metrics such as accuracy, precision, recall, F1
score, and ROC-AUC. The evaluation process involves cross-validation and testing on a separate
dataset to ensure generalizability.

6. Results:
The results of the ensemble models are compared with those of individual models and traditional
fraud detection methods. The ensemble learning approach is expected to demonstrate superior
performance in terms of accuracy and robustness.

7. Interpretability:

An analysis of the interpretability of the ensemble models is conducted to provide insights into
the decision-making process. Understanding the factors contributing to fraud detection can aid in
model validation and trust-building.

8. Conclusion:

The case study concludes with a summary of the findings, highlighting the effectiveness of
ensemble learning in credit card fraud detection. Recommendations for implementing these
models in real-world scenarios and potential avenues for future research are also discussed.
Unit 5: Machine Learning in Practice
Data preprocessing, Model selection, Evaluation, Deployment, Ethics of machine learning

Data preprocessing
Data preprocessing is a crucial step in the machine learning pipeline that involves cleaning,
organizing, and transforming raw data into a format suitable for training and evaluating machine
learning models. The goal is to enhance the quality of the data, address potential issues, and
prepare it for effective model training. Here are key aspects of data preprocessing in the context
of machine learning projects:
1. Handling Missing Data:
 Identify and handle missing values in the dataset. This may involve imputing
missing values using statistical measures (mean, median, mode) or advanced
techniques such as interpolation.
2. Dealing with Outliers:
 Detect and address outliers that could significantly impact model performance.
This may involve removing outliers or transforming them to bring them within an
acceptable range.
3. Data Cleaning:
 Clean the dataset by addressing inconsistencies, errors, or discrepancies. This
includes fixing typos, standardizing formats, and ensuring data consistency.
4. Feature Scaling:
 Normalize or scale features to ensure that all features contribute equally to model
training. Common methods include Min-Max scaling or standardization (Z-score
normalization).
5. Encoding Categorical Variables:
 Convert categorical variables into a format suitable for machine learning
algorithms. This often involves one-hot encoding, where categorical variables are
transformed into binary vectors.
6. Handling Imbalanced Data:
 Address class imbalances in classification tasks to prevent models from being
biased towards the majority class. Techniques include oversampling,
undersampling, or using specialized algorithms for imbalanced data.
7. Feature Engineering:
 Create new features or transform existing ones to capture more relevant
information for the model. Feature engineering can improve a model's ability to
understand complex relationships within the data.
8. Data Splitting:
 Split the dataset into training, validation, and test sets. The training set is used to
train the model, the validation set helps tune hyperparameters, and the test set
evaluates the model's performance on unseen data.
9. Handling Time-Series Data:
 For time-series data, handle temporal aspects such as time gaps, missing values,
and seasonality. This may involve resampling, lag features, or other time-specific
preprocessing steps.
10. Documentation and Logging:
 Document all steps taken during data preprocessing and maintain a record of
transformations applied. Logging helps in reproducing results, debugging, and
ensuring transparency in the machine learning pipeline.

Model Selection in Machine Learning Projects

Model Selection in Machine Learning Projects: Picking the Right Tool for the Job
Choosing the right model for your ML project is like picking the perfect tool for a DIY project.
You wouldn't use a screwdriver for hammering, and you wouldn't use a chainsaw for trimming
nails! Similarly, in the vast toolbox of ML algorithms, selecting the best one for your specific
problem is crucial for success. Let's break down the practical side of model selection:
Why is it important?
Imagine throwing every tool you own at a problem! It's probably going to be messy and
inefficient. Likewise, using the wrong model in your project can lead to:
 Poor performance: Inaccurate predictions, weak generalizability, and wasted resources.
 Overfitting: Memorizing the training data but failing on unseen examples.
 Underfitting: Failing to capture the underlying patterns in the data.
Factors to Consider:
 Problem type: Regression, classification, clustering, etc. Each requires different types of
models.
 Data characteristics: Size, dimensionality, complexity, distribution. Some models work
better with specific data types.
 Project constraints: Training time, computational resources, model
interpretability. Different models have different demands.
 Business needs: Accuracy, explainability, real-time predictions. Prioritize metrics
relevant to your goals.
Common Model Selection Techniques:
 Hold-out validation: Split your data into training and testing sets, train models on the
training set, and evaluate them on the unseen testing set.
 K-fold cross-validation: Randomly split your data into k folds, train models on k-1
folds, and test on the remaining fold, repeat for all k folds, and average the results.
 Grid search and hyperparameter tuning: Explore different combinations of model
parameters (hyperparameters) to find the optimal configuration for your data.
 Model comparison metrics: Accuracy, precision, recall, F1-score, AUC-ROC. Choose
metrics that align with your project goals.
Real-world Example:
Building a model to predict house prices. You might consider:
 Regression models: Linear regression, decision trees, or gradient boosting for continuous
price prediction.
 Feature engineering: Create features like "distance to amenities" or "average price in the
neighborhood".
 Model comparison: Use k-fold cross-validation to compare performance of different
models on the same data.
 Choose the model: Select the one with the best average score on the chosen metric
(e.g., mean squared error) while considering additional factors like interpretability for
explaining price variations.
Tips for Effective Selection:
 Start simple: Don't jump into complex models too quickly. Consider baseline models for
initial comparisons.
 Iterate and refine: Keep trying different models, hyper parameters, and feature
engineering techniques.
 Understand the trade-offs: No model is perfect. Balance accuracy with factors like
training time, interpretability, and resource constraints.
 Visualize your results: Use plots and charts to understand model behavior and compare
performance.
Remember, model selection is an art, not a science. There's no one-size-fits-all solution. By
carefully considering your specific project context and applying these practical tips, you can
choose the right tool for the job and build successful ML models that deliver real value.

Model evaluation
Model evaluation is a critical aspect of machine learning projects that involves assessing the
performance and generalization ability of a trained model. It helps determine how well the model
is likely to perform on new, unseen data. Here are key considerations and techniques for model
evaluation in the context of machine learning projects:
1. Metrics Selection:
 Choose appropriate evaluation metrics based on the nature of the problem.
Common metrics include accuracy, precision, recall, F1 score for classification
tasks, and mean squared error, R-squared for regression tasks. The choice of
metric depends on the project goals and the specific characteristics of the data.
2. Confusion Matrix:
 For classification problems, construct a confusion matrix to visualize the model's
performance in terms of true positives, true negatives, false positives, and false
negatives. This aids in understanding the types of errors the model makes.
3. Cross-Validation:
 Implement cross-validation techniques, such as k-fold cross-validation or leave-
one-out cross-validation, to robustly assess model performance. This helps
mitigate the impact of variability in the training and test datasets.
4. Learning Curves:
 Analyze learning curves to understand how the model's performance changes with
respect to the amount of training data. This helps identify issues like overfitting or
underfitting.
5. ROC Curve and AUC-ROC:
 For binary classification problems, plot Receiver Operating Characteristic (ROC)
curves and calculate the Area Under the Curve (AUC-ROC). This provides
insights into the trade-off between sensitivity and specificity.
6. Precision-Recall Curve:
 For imbalanced classification problems, examine precision-recall curves, which
illustrate the precision-recall trade-off at different decision thresholds.
7. Feature Importance:
 Assess the importance of features in the model to understand which features
contribute most to the predictions. This can be done using techniques such as
permutation importance or feature importance plots.
8. Model Comparison:
 Compare the performance of multiple models to select the best-performing one.
This can involve statistical tests or visualizations to highlight differences in
performance.
9. Bias and Fairness Evaluation:
 Evaluate the model for bias and fairness, especially in sensitive applications.
Analyze the model's behavior across different demographic groups to ensure
equitable outcomes.
10. Deployment Considerations:
 Consider the practical implications of deploying the model, including the
potential impact on end-users and the broader environment. Evaluate the model's
robustness to changes in the input data distribution.
Model deployment
Model deployment in the context of machine learning projects refers to the process of taking a
trained machine learning model and integrating it into a production environment where it can be
used to make predictions on new, unseen data. It involves several steps to ensure that the model
functions effectively, reliably, and securely in real-world applications. Here's an overview of the
key considerations in model deployment:
1. Model Serialization:
 Serialize the trained model to a format that can be easily loaded and utilized by
the deployment environment. Common serialization formats include pickle,
joblib, or formats compatible with specific deployment frameworks.
2. Scalability:
 Ensure that the deployed model can scale to handle varying workloads and data
volumes. This may involve using scalable infrastructure, containerization (e.g.,
Docker), or cloud-based solutions.
3. API Development:
 Create an API (Application Programming Interface) to expose the model's
functionality, allowing other software applications to interact with and make
predictions using the model. RESTful APIs are commonly used for this purpose.
4. Input Validation:
 Implement robust input validation to ensure that the incoming data meets the
model's expectations. This includes checking data types, ranges, and handling
missing values appropriately.
5. Security Measures:
 Implement security measures to protect the deployed model from potential
vulnerabilities. This may involve encryption of data in transit, access controls, and
regular security audits.
6. Monitoring and Logging:
 Set up monitoring systems to track the model's performance in real-time. Logging
should capture relevant information, including input data, predictions, and any
issues that may arise during deployment.
7. Versioning:
 Establish version control for the deployed models to track changes and updates.
This ensures that different versions of the model can be easily managed, rolled
back, or upgraded without disrupting the system.
8. Model A/B Testing:
 Implement A/B testing methodologies to evaluate the performance of different
model versions in a live environment. This allows for data-driven decisions
regarding model improvements or changes.
9. Downtime Mitigation:
 Plan for and mitigate downtime during updates or maintenance. This may involve
deploying redundant instances, load balancing, or using strategies like canary
releases to minimize service interruptions.
10. Documentation:
 Create comprehensive documentation for the deployed model, including
information on the API, input requirements, output format, and any specific
considerations for users or developers interacting with the model.
Model deployment is a critical phase that bridges the gap between the development of machine
learning models and their practical application in real-world scenarios. Effective deployment
ensures that the benefits of the trained model are realized in production while maintaining
performance, reliability, and security. It requires collaboration between data scientists, software
engineers, and DevOps teams to create a seamless and efficient deployment pipeline.

Ethics in machine learning


Ethics in machine learning refers to the responsible and fair development, deployment, and use
of machine learning models and algorithms. In the context of machine learning projects,
addressing ethical considerations is crucial to ensure that the technology benefits society without
causing harm or reinforcing biases. Here are key aspects of ethics in machine learning projects:
1. Fairness and Bias:
 Assess and mitigate biases in the training data and model predictions. Pay
attention to potential disparities across different demographic groups to avoid
reinforcing or exacerbating existing societal biases.
2. Transparency:
 Strive for transparency in model development and decision-making processes.
Clearly communicate how models work, the data used, and the potential impact of
model predictions. Transparent models enhance trust and accountability.
3. Accountability:
 Establish accountability for the impact of machine learning models. Clearly
define roles and responsibilities, and hold individuals and organizations
accountable for the ethical implications of their models.
4. Data Privacy:
 Prioritize data privacy by implementing measures to protect sensitive information.
Comply with data protection regulations, and consider anonymization or
encryption techniques to safeguard user privacy.
5. Informed Consent:
 Obtain informed consent from individuals whose data is used for training models.
Clearly communicate the purposes of data collection and how the data will be
used, giving users the option to opt in or opt out.
6. Robustness and Reliability:
 Ensure that machine learning models are robust and reliable across different
contexts. Consider potential adversarial attacks and implement measures to
enhance model robustness in real-world scenarios.
7. Interpretability:
 Design models that are interpretable and understandable by humans.
Interpretability aids in understanding model decisions, allowing stakeholders to
identify and rectify potential biases or ethical concerns.
8. Continual Monitoring:
 Implement ongoing monitoring of deployed models to identify and address ethical
issues that may arise as the model interacts with new data in a live environment.
9. Stakeholder Engagement:
 Involve diverse stakeholders, including affected communities, in the decision-
making process. Gather input from individuals who may be impacted by the use
of machine learning models to ensure a more comprehensive understanding of
potential ethical implications.
10. Societal Impact Assessment:
 Conduct a societal impact assessment to anticipate and evaluate the broader
consequences of deploying machine learning models. This includes considering
potential economic, social, and environmental impacts.
Ethics in machine learning is an evolving field, and it requires a proactive and collaborative
effort from data scientists, developers, policymakers, and other stakeholders. By incorporating
ethical considerations into the entire machine learning lifecycle, practitioners can help build and
deploy models that contribute positively to society while minimizing risks and negative
consequences. Regularly reassessing ethical practices and staying informed about evolving
ethical guidelines is essential in the rapidly changing landscape of machine learning technology.

You might also like