Machine Learning (ML) is a branch Evaluating the model − When Neural Networks − It works like the
of Artificial Intelligence (AI) that module is trained, the model has to human brain with many connected
works on algorithm developments be tested on new data that they nodes. They help to find patterns
and statistical models that allow haven't been able to see during and are used in language
computers to learn from data and training. processing, image and speech
make predictions or decisions recognition, and creating images.
without being explicitly Hyperparameter Tuning and
programmed. Optimization − After evaluating the Linear Regression − It predicts
model, you may need to adjust its numbers based on past data. For
How does Machine Learning hyperparameters to make it more example, it helps estimate house
Work? efficient. prices in an area.
Machine Learning process includes Predictions and Deployment − Logistic Regression − It predicts like
Project Setup, Data Preparation, When the model has been "yes/no" answers and it is useful for
Modeling and Deployment. The programmed and optimized, it will spam detection and quality control.
following figure demonstrates the be ready to estimate new data. This
common working process of is done by adding new data to the Clustering − It is used to group
Machine Learning. It follows some model and using its output for similar data without instructions
set of steps to do the task decision-making or other analysis. and it helps to find patterns that
humans might miss.
Sequential Process flow of Types of Machine Learning
Machine Learning Decision Trees − They help to
1. Supervised Machine Learning − classify data and predict numbers
Data Collection − Data collection is It is a type of machine learning that using a tree-like structure. They are
an initial step in the process of trains the model using labeled easy to check and understand.
machine learning. In this stage, it datasets to predict outcomes.
collects data from the different Random forests − They combine
sources such as databases, text 2. Unsupervised Machine Learning multiple decision trees to improve
files, pictures, sound files, or web − It is a type of machine learning predictions.
scraping. that learns patterns and structures
within the data without human Importance of Machine Learning
Data Pre-processing − It is a key supervision.
Data Processing − Machine learning
step in the process of machine
3. Semi-supervised Learning − It is is useful to analyze large data from
learning, which involves deleting
a type of machine learning that is social media, sensors, and other
duplicate data, fixing errors,
neither fully supervised nor fully sources and help to reveal patterns
managing missing data either by
unsupervised. The semi-supervised and insights to improve decision-
eliminating or filling it in, and
learning algorithms basically fall making.
adjusting and formatting the data.
between supervised and
Data-Driven Insights − Machine
Choosing the Right Model − The unsupervised learning methods.
learning algorithms find trends and
next step is to select a machine
4. Reinforcement Machine connections in big data that
learning model; once data is
Learning − It is a type of machine humans might miss, which helps to
prepared, then we apply it to ML
learning model that is similar to take better decisions and
models like linear regression,
supervised learning but does not predictions.
decision trees, and neural networks
that may be selected to implement. use sample data to train the
Automation − Machine learning
algorithm. This model learns by trial
automates the repetitive tasks,
Training the Model − This step and error.
reducing errors and saving time.
includes training the model from
the data so it can make better Several machine learning
Personalization − Machine learning
predictions. algorithms are commonly used.
is useful to analyze the user
These include:
preferences to provide
personalized recommendations in websites and social media, handling Whereas Deep learning uses a
e-commerce, social media, and FAQs, giving recommendations, and complex structure of algorithms
streaming services. assisting in e-commerce. developed similar to the human
brain.
Predictive Analytics − Machine Computer Vision − It helps
learning models use past data to computers in analyzing the images The effectiveness of deep learning
predict future outcomes, which and videos to take action. models for complex problems is
may help for sales forecasts, risk more compared to machine
management, and demand Recommendation Engines − ML learning models.
planning. recommendation engines suggest
products, movies, or content based Machine Learning Vs. Generative
Pattern Recognition − Machine on user behavior. AI
learning is useful in pattern
recognition during image Robotic Process Automation (RPA) Machine learning and Generative AI
processing, speech recognition, and − RPA uses AI to automate are different branches with
natural language processing. repetitive tasks and reduce manual different applications. While
work. Machine Learning is used for
Finance − Machine learning is used predictive analysis and decision-
in credit scoring, fraud detection, Automated Stock Trading − AI- making, Generative AI focuses on
and algorithmic trading. driven trading platforms make creating content, including realistic
rapid trades to optimize stock images and videos in existing
Retail − Machine learning helps to portfolios without human patterns.
enhance the recommendation intervention.
systems, supply chain What is Machine Learning Life
management, and customer How does Machine Learning work? Cycle?
service.
Decision Process − Based on the The machine learning life cycle is an
Fraud Detection & Cybersecurity − input data and output labels iterative process that moves from a
Machine learning detects the provided to the model, it will business problem to a machine
fraudulent transactions and produce a logic about the pattern learning solution.
security threats in real time. identified.
ML Life Cycle
Continuous Improvement − Cost Function − It is the measure of
Machine learning models update error between expected value and Problem Definition
regularly with new data, which predicted value.
The first step in the machine
allows them to adapt and improve
Optimization Process − Cost learning life cycle is to identify the
over time.
function can be minimized by problem you want to solve. As this
Applications of Machine Learning adjusting the weights at the step lays the foundation for
training stage. building a machine learning model,
Machine learning is used in various the problem definition has to be
fields. Some of the most common Machine Learning Vs. Deep clear and concise.This stage
applications include: Learning involves understanding the
business problem, defining the
Speech Recognition − Machine Deep learning is a sub-field of
problem statement, and identifying
learning is used to convert spoken Machine learning. The actual
the success criteria for the machine
language into text using natural difference between these is the
learning model.
language processing (NLP). way the algorithm learns.
Data Preparation
Customer Service − There are In Machine learning, computers
several chatbots that are useful for learn from large datasets using Data preparation is a process to
reducing human interaction and algorithms to perform tasks like prepare data for analysis by
providing better support on prediction and recommendation. performing data exploration,
feature engineering, and feature data is important to improve the 1. Model Selection
selection. Data exploration involves accuracy and performance of the
visualizing and understanding the machine learning model. Issues that Model selection is a crucial step in
data, while feature engineering have to be addressed are missing the machine learning workflow. The
involves creating new features from values, duplicate data, invalid data decision of choosing a model
the existing data. Data preparation and noise. depends on basic features like
process includes collecting data, characteristics of the data,
preprocessing data, and feature 3. Analyzing Data complexity of the problem, desired
engineering & feature selection. outcomes and how well it aligns
After the data is all sorted, it is time with the defined problem.
Let's discuss each step involved in
to understand the data that is
the data preparation phase of
collected. The data is visualized and 2. Model Training
machine learning life cycle process
statistically summarized to gain
insights.Various tools like Power BI, In this process, the algorithm is fed
1. Data Collection
Tableau are used to visualize data with a preprocessed dataset to
After the problem statement is which helps in understanding the identify and understand the
analyzed, the next step would be patterns and trends in the data. patterns and relationships in the
collecting data. This involves specified features.Consistent
gathering data from various sources 4. Feature Engineering and training of a model by adjusting
which is given as a raw material to Selection parameters would improve the
the machine learning model. Few prediction rate and enhance
A 'Feature' is an individual accuracy.
features that are considered while
measurable quantity which is
collecting data are −
preferably observed when the 3. Model Evaluation
Relevant and usefulness − The data
machine learning model is being
collected has to be relevant to the
trained. Feature Engineering is the In model evaluation, the
problem statement, and also
process of creating new features or performance of the machine
should be useful enough to train
enhancing the existing ones to learning model is evaluated using a
the machine learning model
accurately understand the patterns set of evaluation metrics. These
efficiently.
and trends in the data.Feature metrics measure the accuracy,
selection involves the process of precision, recall, and F1 score of the
Quality and Quantity − The quality
picking up features that are model.
and quantity of the data collected
would directly impact the consistent and more relevant to the
Model Deployment
performance of the machine problem statement.
learning model. In the model deployment phase, we
Model Development
deploy the machine learning model
Variety − Make sure that the data
In the model development phase, into production. This process
collected is diverse so that the
the machine learning model is built involves integrating the tested
model can be trained with multiple
using the prepared data. The model model with existing systems to
scenarios to recognize the patterns.
building process involves selecting make it available to users,
the appropriate machine learning management or other purposes.
There are various sources from
algorithm, algorithm training, This also involves testing the model
where the data can be collected like
tuning the hyperparameters of the in a real-world scenario.Two
surveys, existing databases, and
algorithm, and evaluating the important factors that have to be
online platforms like Kaggle.
performance of the model using checked before deploying are
2. Data Preprocessing cross-validation techniques.This whether the model is portable i.e,
phase mainly consists of three the ability to transfer the software
The data collected often might be steps, model selection, model from one machine to another and
unstructured and messy which training, and model evaluation. scalable i.e, the model need not be
causes it to negatively affect the Let's discuss these three steps in redesigned to maintain
outcomes, hence pre processing detail − performance.
3. Minimizing the Error: The Least β₁ is the slope, indicating how much
Squares Method y changes for each unit change in x.
Monitor and Maintenance
To find the best-fit line, we use a For multiple linear regression (with
Monitoring in machine learning method called Least Squares. The more than one independent
involves techniques to measure the idea behind this method is to variable), the hypothesis function
model performance metrics and to minimize the sum of squared expands to:
detect issues in the differences between the actual
models.Sometimes when the issue values (data points) and the h(x₁,x₂,...,xₖ)=β₀+β₁x₁+β₂x₂+...+βₖxₖh
detected in the designed model predicted values from the line.
cannot be solved with training it Where: x₁,x₂,...,xₖ are the
These differences are called
with new data, the issue becomes independent variables.
residuals.The formula for residuals
the problem statement. is:Residual=yᵢ−y^ᵢ
β₀ is the intercept.
Best Fit Line in Linear Regression Where:
β₁,β₂,...,βₖ are the coefficients,
In linear regression, the best-fit line representing the influence of each
yᵢ is the actual observed value
is the straight line that most respective independent variable on
accurately represents the y^ ᵢ is the predicted value from the the predicted output.
relationship between the line for that
independent variable (input) and
the dependent variable (output). It The least squares method
Types of Linear Regression
is the line that minimizes the minimizes the sum of the squared
difference between the actual data residuals: Sum of squared
1. Simple Linear Regression
points and the predicted values errors(SSE)=Σ(yᵢ−y^ᵢ)²
from the model. Simple linear regression is used
This method ensures that the line
when we want to predict a target
2. Equation of the Best-Fit Line best represents the data where the
value (dependent variable) using
sum of the squared differences
only one input feature
For simple linear regression (with between the predicted values and
(independent variable). It assumes
one independent variable), the actual values is as small as possible.
a straight-line relationship between
best-fit line is represented by the
the two.
equation y=mx+b Where: Hypothesis function in Linear
Regression
Formula: y^=θ0+θ1x
y is the predicted value (dependent
variable) In linear regression, the hypothesis
Where:
function is the equation used to
x is the input (independent make predictions about the y^ is the predicted value
variable) dependent variable based on the
independent variables.For a simple x is the input (independent
m is the slope of the line (how case with one independent variable)
much y changes when x changes) variable, the hypothesis function is:
θ 0 is the intercept (value of y^
b is the intercept (the value of y h(x)=β₀+β₁x Where: when x=0) θ1is the slope or
when x = 0) coefficient (how much y^ changes
h(x)(or y^ ) is the predicted value of with one unit of x) . Example:
The best-fit line will be the one that the dependent variable (y).x is the Predicting a person’s salary (y)
optimizes the values of m (slope) independent variable. based on their years of experience
and b (intercept) so that the
(x).
predicted y values are as close as β₀ is the intercept, representing the
possible to the actual data points. value of y when x is 0.
2. Multiple Linear Regression
Multiple linear regression involves the MSE value converges to the Instances of b: 5
more than one independent global minima, signifying the most Instances of a: 3
variable and one dependent accurate fit of the linear regression
Entropy H(X)=[(38)log238+(58)l
variable. The equation for multiple line to the dataset. og258]=−[0.375(−1.415)+0.625(
linear regression is: −0.678)]=−(−0.53−0.424)=0.954E
Decision Tree in Machine Learning ntropy H(X)=[(83)log283+(85
y^=θ 0+θ 1x 1+θ 2x 2+⋯+θ nx n )log285
A decision tree is a supervised ]=−[0.375(−1.415)+0.625(−0.678)]=
where: y^ is the predicted learning algorithm used for both 2. Gini Index
Gini Index is a metric to measure
valuex1,x2,…,xn classification and regression tasks. how often a randomly chosen
It has a hierarchical tree structure element would be incorrectly
x 1,x 2,…,x nare the independent which consists of a root node, identified. It means an attribute with
variables θ1,θ2,…,θn are the branches, internal nodes and leaf a lower Gini index should be
coefficients (weights) preferred.
nodes. It It works like a flowchart
Formula for Gini Index is given by :
corresponding to each predictor.θ0 help to make decisions step by step Gini=1−∑i=1npi2Gini=1−∑i=1n
is the intercept. where: pi2
The goal of the algorithm is to find Internal nodes represent attribute
the best Fit Line equation that can tests Types of Decision Tree Algorithms
predict the values based on the
independent variables. Branches represent attribute values There are six different decision tree
algorithms as shown in diagram are
Cost function for Linear Regression Leaf nodes represent final decisions
listed below.
or predictions.
As we have discussed earlier about 1. ID3 (Iterative Dichotomiser
best fit line in linear regression, its Information Gain and Gini Index in 3)
not easy to get it easily in real life Decision Tree ID3 is a classic decision tree
cases so we need to calculate errors algorithm commonly used for
that affects it. 1. Information Gain classification tasks. It works by
greedily choosing the feature
In Linear Regression, the Mean Information Gain tells us how that maximizes the information
useful a question (or feature) is for gain at each node.
Squared Error (MSE) cost function is Entropy: It measures impurity in
employed, which calculates the splitting data into groups. It
the dataset. Denoted by H(D) for
average of the squared errors measures how much the dataset D is calculated using the
between the predicted values uncertainty decreases after the formula:
split. A good question will create H(D)=Σi=1npilog2(pi)H(D)=Σi=1
y^i and the actual values yi . The clearer groups and the feature with npilog2(pi)
Information gain: It quantifies
purpose is to determine the the highest Information Gain is the reduction in entropy after
optimal values for the intercept θ1 chosen to make the decision. splitting the dataset on a
and the coefficient of the input feature:
feature θ2 providing the best-fit Gain(S,A)=Entropy(S)−∑ vA∣S∣/∣Sv InformationGain=H(D)−Σv=1V∣D
|.Entropy(S v) v∣∣D∣H(Dv)InformationGain=H(D)
line for the given data points. The
−Σv=1V∣D∣∣Dv∣H(Dv)
linear equation expressing this
Entropy: is the measure of
relationship is y^I =θ1+θ2 x i .
uncertainty of a random variable it
2. C4.5
MSE function can be calculated as: characterizes the impurity of an
C4.5 uses a modified version of
arbitrary collection of examples. information gain called the gain
Cost function(J)=1n∑ni(yi^−yi)2 The higher the entropy more the ratio to reduce the bias towards
information content. features with values.which measures
Utilizing the MSE function, the the amount of data required to
iterative process of gradient Example: describe an attribute values:
For the set X = {a,a,a,b,b,b,b,b} GainRatio=SplitgainGaininformatio
descent is applied to update the
Total instances: 8 nGainRatio=GaininformationSplitn
values of \1&θ 2 . This ensures that
3. CART (Classification and higher C value forces stricter
Regression Trees) Support Vector Machine (SVM) penalty for misclassifications.
CART is a widely used decision Algorithm Hinge Loss: A loss function
tree algorithm that is used for penalizing misclassified points
Support Vector Machine (SVM)
classification and regression or margin violations and is
is a supervised machine learning
tasks.The feature that minimizes combined with regularization
algorithm used for classification
the Gini impurity is selected for in SVM.
and regression tasks. It tries to
splitting at each node. The find the best boundary known as Dual Problem: Involves
formula is: hyperplane that separates solving for Lagrange
Gini(D)=1−Σi=1npi2Gini(D)=1−Σ different classes in the data. It is multipliers associated with
i=1npi2 useful when you want to do support vectors, facilitating
where pipi is the probability of binary classification like spam the kernel trick and efficient
class ii in dataset DD. vs. not spam or cat vs. dog. computation.
The main goal of SVM is to Mathematical Computation of
4. CHAID (Chi-Square maximize the margin between SVM
Automatic Interaction the two classes. The larger the Consider a binary classification
Detection) margin the better the model problem with two classes,
performs on new and unseen labeled as +1 and -1. We have a
CHAID uses chi-square tests to
data. training dataset consisting of
determine the best splits
Key Concepts of Support input feature vectors X and their
especially for categorical
Vector Machine corresponding class labels Y.
variables. This approach is
The equation for the linear
particularly useful for analyzing Hyperplane: A decision
hyperplane can be written as:
large datasets with many boundary separating different wTx+b=0wTx+b=0
categorical features. The Chi- classes in feature space and
Where:
Square Statistic formula: is represented by the
X2=Σ(Oi−Ei)2EiX2=ΣEi(Oi−Ei)2 equation wx + b = 0 in linear ww is the normal vector to the
Where: classification. hyperplane (the direction
perpendicular to it).
OiOi represents the observed Support Vectors: The closest
frequency data points to the hyperplane, bb is the offset or bias term
representing the distance of
EiEi represents the expected crucial for determining the
hyperplane and margin in the hyperplane from the origin
frequency in each category.
SVM. along the normal vector ww.
Distance from a Data Point to
Margin: The distance
between the hyperplane and the Hyperplane
5. MARS (Multivariate
the support vectors. SVM The distance between a data
Adaptive Regression Splines)
aims to maximize this margin point xixiand the decision
MARS is an extension of the boundary can be calculated as:
CART algorithm. It uses splines for better classification
performance. di=wTxi+b∣∣w∣∣di=∣∣w∣∣wTxi+b
to model non-linear relationships where ||w|| represents the
between variables. Kernel: A function that maps
data to a higher-dimensional Euclidean norm of the weight
Basis Functions: Each basis vector w.
function in MARS is a simple space enabling SVM to
handle non-linearly separable Linear SVM Classifier
linear function defined over a
data. Distance from a Data Point to
range of the predictor variable.
Hard Margin: A maximum- the Hyperplane:
The function is described as:
y^={1: wTx+b≥00: wTx+b <0y^
h(x)={x−tifx>tt−xifx≤t}h(x)={x−tif margin hyperplane that
perfectly separates the data ={10: wTx+b≥0: wTx+b <0
x>tt−xifx≤t}
without misclassifications. Where y^y^ is the predicted
Where
label of a data point.
xx is a predictor variable Soft Margin: Allows some
Optimization Problem for SVM
ttis the knot function. misclassifications by
For a linearly separable dataset
Knot Function: The knots are introducing slack variables,
balancing margin the goal is to find the hyperplane
the points where the piecewise that maximizes the margin
linear functions connect. MARS maximization and
misclassification penalties between the two classes while
places these knots to best ensuring that all data points are
represent the data's non-linear when data is not perfectly
separable. correctly classified. This leads to
structure. the following optimization
C: A regularization term
problem:
balancing margin
minimizew,b12∥w∥2w,bminimize
maximization and
21∥w∥2
misclassification penalties. A
Subject to the constraint:
yi(wTxi+b)≥1fori=1,2,3,⋯,m K(xi,xj)K(xi,xj) is the kernel nonlinear SVMs can handle
yi(wTxi+b)≥1fori=1,2,3,⋯,m function that computes the nonlinearly separable data.
Where: similarity between data
yiyi is the class label (+1 or - points xixi and xjxj. The kernel Advantages of Support Vector
1) for each training instance. allows SVM to handle non- Machine (SVM)
xixi is the feature vector for linear classification problems 1. High-Dimensional
the ii-th training instance. by mapping data into a Performance: SVM excels in
mm is the total number of higher-dimensional space. high-dimensional spaces,
The dual formulation optimizes making it suitable for image
training instances.
the Lagrange multipliers αiαi classification and gene
The condition yi(wTxi+b)≥1yi
and the support vectors are expression analysis.
(wTxi+b)≥1 ensures that each
those training samples 2. Nonlinear Capability:
data point is correctly classified
where αi>0αi>0. Utilizing kernel functions like
and lies outside the margin.
SVM Decision Boundary RBF and polynomial SVM
Soft Margin in Linear SVM
Once the dual problem is effectively handles nonlinear
Classifier relationships.
solved, the decision boundary is
In the presence of outliers or given by: 3. Outlier Resilience: The soft
non-separable data the SVM w=∑i=1mαitiK(xi,x)+bw=∑i=1 margin feature allows SVM to
allows some misclassification by mαitiK(xi,x)+b ignore outliers, enhancing
introducing slack variables ζiζi. Where ww is the weight robustness in spam detection
The optimization problem is vector, xx is the test data point and anomaly detection.
modified as: and bb is the bias term. Finally 4. Binary and Multiclass
minimize w,b12∥w∥2+C the bias term bb is determined Support: SVM is effective for
∑i=1mζiw,bminimize 21 by the support vectors, which both binary classification and
∥w∥2+C∑i=1mζi satisfy: multiclass classification
Subject to the constraints: ti(wTxi−b)=1⇒b=wTxi−titi suitable for applications in text
yi(wTxi+b)≥1−ζiandζi≥0for i (wTxi−b)=1⇒b=wTxi−ti classification.
=1,2,…,myi(wTxi+b)≥1−ζiandζi Where xixi is any support 5. Memory Efficiency: It
≥0for i=1,2,…,m vector. focuses on support vectors
Where: This completes the making it memory efficient
CC is a regularization mathematical framework of the compared to other algorithms.
parameter that controls the Support Vector Machine
trade-off between margin algorithm which allows for both
maximization and penalty for linear and non-linear Disadvantages of Support
misclassifications. classification using the dual Vector Machine (SVM)
ζiζi are slack variables that problem and kernel trick. 1. Slow Training: SVM can be
represent the degree of Types of Support Vector slow for large datasets,
violation of the margin by Machine affecting performance in SVM
each data point. Based on the nature of the in data mining tasks.
Dual Problem for SVM decision boundary, Support 2. Parameter Tuning Difficulty:
The dual problem involves Vector Machines (SVM) can be Selecting the right kernel and
maximizing the Lagrange divided into two main parts: adjusting parameters like C
multipliers associated with the Linear SVM: Linear SVMs requires careful tuning,
support vectors. This use a linear decision impacting SVM algorithms.
transformation allows solving boundary to separate the data 3. Noise Sensitivity: SVM
the SVM optimization using points of different classes. struggles with noisy datasets
kernel functions for non-linear When the data can be and overlapping classes,
classification. precisely linearly separated, limiting effectiveness in real-
The dual objective function is linear SVMs are very suitable. world scenarios.
given by: This means that a single 4. Limited Interpretability: The
maximize α12∑i=1m∑j=1mαiα straight line (in 2D) or a complexity of the hyperplane
jtitjK(xi,xj)−∑i=1mαiαmaximize hyperplane (in higher in higher dimensions makes
21∑i=1m∑j=1mαiαjtitjK(xi,xj dimensions) can entirely SVM less interpretable than
)−∑i=1mαi divide the data points into other models.
Where: their respective classes. 5. Feature Scaling Sensitivity:
αiαi are the Lagrange Non-Linear SVM: Non-Linear Proper feature scaling is
multipliers associated with SVM can be used to classify essential, otherwise SVM
the ithith training sample. models may perform poorly.
data when it cannot be
titi is the class label for separated into two classes by a
the ithith-th training sample. straight line (in the case of 2D).
By using kernel functions,
What is Kernel? Example: Imagine you're neighbors. K approaches
Instead of explicitly computing deciding which fruit it is based infinity, these boundaries
the transformation the kernel on its shape and size. You approach the Voronoi diagram
computes the dot product of compare it to fruits you already boundaries .
data points in the higher- know. Voronoi Diagram as a
dimensional space directly that If k = 3, the algorithm looks at Special Case: When k = 1
helps a model find patterns in the 3 closest fruits to the new KNN’s decision boundaries
complex data and transforming one. directly correspond to the
the data into a higher- If 2 of those 3 fruits are Voronoi diagram of the
dimensional space where it apples and 1 is a banana, the training points. Each region in
becomes easier to separate algorithm says the new fruit is the Voronoi diagram
different classes or detect an apple because most of its represents the area where the
relationships. neighbors are apples. nearest training point is
Popular kernel functions in Distance Metrics Used in KNN closest.
SVM Algorithm Ensemble Learning
Radial Basis Function To identify nearest neighbor we Ensemble learning is a method
(RBF): Captures patterns in use below distance metrics: where we use many small
data by measuring the 1. Euclidean Distance models instead of just one. Each
distance between points and Euclidean distance is defined as of these models may not be very
is ideal for circular or the straight-line distance strong on its own, but when we
spherical relationships. It is between two points in a plane or put their results together, we get
widely used as it creates space. a better and more accurate
flexible decision boundary. distance(x,Xi)=∑j=1d(xj−Xij)2]dis answer. It's like asking a group
Linear Kernel: Works for data tance(x,Xi)=∑j=1d(xj−Xij)2] of people for advice instead of
that is linearly separable 2. Manhattan Distance just one person—each one
problem without complex This is the total distance you might be a little wrong, but
transformations. would travel if you could only together, they usually give a
better answer.
Polynomial Kernel: Models move along horizontal and
vertical lines like a grid or city Types of Ensembles Learning
more complex relationships
streets. It’s also called "taxicab in Machine Learning
using polynomial equations.
There are three main types of
Sigmoid Kernel: Mimics distance" because a taxi can
ensemble methods:
neural network behavior using only drive along the grid-like
streets of a city. 1. Bagging (Bootstrap
sigmoid function and is Aggregating): Models are
suitable for specific non-linear d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n
∣xi−yi∣ trained independently on
problems. different random subsets of
3. Minkowski Distance
the training data. Their results
Minkowski distance is like a
are then combined—usually
K-Nearest Neighbor(KNN) family of distances, which
by averaging (for regression)
Algorithm includes both Euclidean and
or voting (for classification).
Manhattan distances as special
K-Nearest Neighbors (KNN) is a This helps reduce variance
cases.
supervised machine learning and prevents overfitting.
d(x,y)=(∑i=1n(xi−yi)p)1pd(x,y)=(
algorithm generally used for 2. Boosting: Models are trained
∑i=1n(xi−yi)p)p1
classification but can also be one after another. Each new
From the formula above, when
used for regression tasks. It model focuses on fixing the
p=2, it becomes the same as the
works by finding the "k" closest errors made by the previous
Euclidean distance formula and
data points (neighbors) to a ones. The final prediction is a
when p=1, it turns into the
given input and makes a weighted combination of all
Manhattan distance formula.
predictions based on the models, which helps reduce
majority class (for classification) bias and improve accuracy.
Relationship Between KNN
or the average value (for 3. Stacking (Stacked
Decision Boundaries and Generalization): Multiple
regression).
Voronoi Diagrams different models (often of
What is 'K' in K Nearest In two-dimensional space the different types) are trained
Neighbour? decision boundaries of KNN can and their predictions are used
In the k-Nearest Neighbours be visualized as Voronoi as inputs to a final model,
algorithm k is just a number that diagrams. Here’s how: called a meta-model. The
tells the algorithm how many KNN Boundaries: The meta-model learns how to
nearby points or neighbors to decision boundary for KNN is best combine the predictions
look at when it makes a determined by regions where of the base models, aiming for
decision. the classification changes better performance than any
based on the nearest individual model.
1. Bagging Algorithm errors of the previous model machines making them highly
Bagging classifier can be used until the entire training dataset is adaptable.
for both regression and predicted correctly. One of the Bias-Variance Tradeoff:
classification tasks. Here is an most well-known boosting Techniques like bagging
overview of Bagging classifier algorithms is AdaBoost reduce variance, while
algorithm: (Adaptive Boosting). Here is an boosting reduces bias leading
Bootstrap Sampling: Divides overview of Boosting algorithm: to better overall performance.
the original training data into Initialize Model Weights:
‘N’ subsets and randomly Begin with a single weak Clustering in Machine
selects a subset with learner and assign equal Learning
replacement in some rows weights to all training
Clustering is an unsupervised
from other subsets. This step examples.
machine learning technique that
ensures that the base models Train Weak Learner: Train groups similar data points
are trained on diverse subsets weak learners on these together into clusters based on
of the data and there is no dataset. their characteristics, without
class imbalance. Sequential Learning: using any labeled data. The
Base Model Training: For Boosting works by training objective is to ensure that data
each bootstrapped sample we models sequentially where points within the same cluster
train a base model each model focuses on are more similar to each other
independently on that subset correcting the errors of its than to those in different
of data. These weak models predecessor. Boosting clusters, enabling the discovery
are trained in parallel to typically uses a single type of of natural groupings and hidden
increase computational weak learner like decision patterns in complex datasets.
efficiency and reduce time trees. Goal: Discover the natural
consumption. We can use Weight Adjustment: grouping or structure in
different base learners i.e. Boosting assigns weights to unlabeled data without
different ML models as base training datapoints. predefined categories.
learners to bring variety and
robustness.
Misclassified examples How: Data points are
receive higher weights in the assigned to clusters based on
Prediction Aggregation: To next iteration so that next similarity or distance
make a prediction on testing models pay more attention to measures.
data combine the predictions them.
of all base models. For Similarity Measures: Can
Benefits of Ensemble
classification tasks it can include Euclidean distance,
Learning in Machine Learning
include majority voting or cosine similarity or other
Ensemble learning is a versatile
weighted majority while for metrics depending on data
approach that can be applied to
regression it involves type and clustering method.
machine learning model for:
Output: Each group is
averaging the predictions. Reduction in Overfitting: By
assigned a cluster ID,
Out-of-Bag (OOB) aggregating predictions of
Evaluation: Some samples representing shared
multiple model's ensembles
are excluded from the training characteristics within the
can reduce overfitting that
subset of particular base cluster.
individual complex models Types of Clustering
models during the might exhibit.
Let's see the types of clustering,
bootstrapping method. These Improved Generalization: It 1. Hard Clustering: In hard
“out-of-bag” samples can be generalizes better to unseen
used to estimate the model’s clustering, each data point
data by minimizing variance strictly belongs to exactly one
performance without the need and bias.
for cross-validation. cluster, no overlap is allowed.
Increased Accuracy: This approach assigns a clear
Final Prediction: After Combining multiple models membership, making it easier to
aggregating the predictions gives higher predictive interpret and use for definitive
from all the base models, accuracy. segmentation tasks.
Bagging produces a final
Robustness to Noise: It Example: If clustering
prediction for each instance.
mitigates the effect of noisy or customer data into 2
incorrect data points by segments, each customer
2. Boosting Algorithm
averaging out predictions belongs fully to either Cluster
Boosting is an ensemble
from diverse models. 1 or Cluster 2 without partial
technique that combines
multiple weak learners to create Flexibility: It can work with memberships.
a strong learner. Weak models diverse models including Use cases: Market
are trained in series such that decision trees, neural segmentation, customer
each next model tries to correct networks and support vector grouping, document
clustering.
Limitations: Cannot approach can identify clusters of 5. Fuzzy Clustering
represent ambiguity or arbitrary shapes, handles noise Fuzzy clustering extends
overlap between groups; well and does not require traditional methods by allowing
boundaries are crisp. predefining the number of each data point to belong to
2. Soft Clustering: Soft clusters, though its effectiveness multiple clusters with varying
clustering assigns each data depends on chosen density degrees of membership. This
point a probability or degree of parameters. approach captures ambiguity
membership to multiple clusters Algorithms: and soft boundaries in data and
simultaneously, allowing data DBSCAN (Density-Based is particularly useful when the
points to partially belong to Spatial Clustering of clusters overlap or boundaries
several groups. Applications with are not clear-cut.
Example: A data point may Noise): Groups points with Algorithm:
have a 70% membership in sufficient neighbors; labels Fuzzy C-Means: Similar to K-
Cluster 1 and 30% in Cluster sparse points as noise. means but with fuzzy
2, reflecting uncertainty or OPTICS (Ordering Points To memberships updated
overlap in group Identify Clustering iteratively.
characteristics. Structure): Extends DBSCAN Use Cases
Use cases: Situations with to handle varying densities. Customer Segmentation:
overlapping class boundaries, 3. Connectivity-based Grouping customers based on
fuzzy categories like customer Clustering (Hierarchical behavior or demographics for
personas or medical Clustering) targeted marketing and
diagnosis. Connectivity-based (or personalized services.
Benefits: Captures ambiguity hierarchical) clustering builds Anomaly Detection:
in data, models gradual nested groupings of data by Identifying outliers or
transitions between clusters. evaluating how data points are fraudulent activities in finance,
Types of Clustering Methods connected to their neighbors. It network security and sensor
Clustering methods can be creates a dendrogram—a tree- data.
classified on the basis of how like structure—that reflects Image Segmentation:
they for clusters, relationships at various Dividing images into
1. Centroid-based Clustering granularity levels and does not meaningful parts for object
(Partitioning Methods) require specifying cluster detection, medical diagnostics
Centroid-based clustering numbers in advance, but can be or computer vision tasks.
organizes data points around computationally intensive. Recommendation Systems:
central prototypes called Approaches: Clustering user preferences to
centroids, where each cluster is Agglomerative (Bottom- recommend movies, products
represented by the mean (or up): Start with each point as a or content tailored to different
medoid) of its members. The cluster; iteratively merge groups.
number of clusters is specified closest clusters. Market Basket Analysis:
in advance and the algorithm Divisive (Top-down): Start Discovering products
allocates points to the nearest with one cluster; iteratively frequently bought together to
centroid, making this technique split into smaller clusters. optimize store layouts and
efficient for spherical and 4. Distribution-based promotions.
similarly sized clusters but Clustering
sensitive to outliers and
Distribution-based clustering REINFORCE Algorithm
initialization.
assumes data is generated from REINFORCE is a method used
Algorithms:
a mixture of probability in reinforcement learning to
K-means: Iteratively assigns distributions, such as Gaussian
points to nearest centroid and improve how decisions are
distributions and assigns points made. It learns by trying actions
recalculates centroids to to clusters based on statistical and then adjusting the chances
minimize intra-cluster likelihood. This method supports of those actions based on the
variance. clusters with flexible shapes and total reward received afterward.
K-medoids: Similar to K- overlaps, but usually requires How REINFORCE Works
means but uses actual data specifying the number of The REINFORCE algorithm
points (medoids) as centers, distributions. works in the following steps:
robust to outliers. Algorithm: Collect Episodes: The agent
2. Density-based Clustering Gaussian Mixture Model interacts with the environment
(Model-based Methods) (GMM): Fits data as a for a fixed number of steps or
Density-based clustering defines weighted mixture of Gaussian until an episode is complete,
clusters as contiguous regions distributions; assigns data following the current policy.
of high data density separated points based on likelihood. This generates a trajectory
by areas of lower density. This
consisting of states, actions High Variance: One of the or moving around. The robot
and rewards. major issues with try different actions and learn
Calculate Returns: For each REINFORCE is its high from what works well or not.
time step tt, calculate the variance. The gradient Game AI: It is used to teach
return GtGt which is the total estimate is based on a single game players like in video
reward obtained from trajectory and the return GtGt games or board games like
time tt onwards. Typically, this can fluctuate significantly, chess. The player learns by
is the discounted sum of making the learning process playing the game many times
rewards: noisy and slow. and figure out what moves led
Gt=∑k=tTγk−tGt=∑k=tTγk−t Sample Inefficiency: Since to win.
Where γγ is the discount REINFORCE requires Self-driving
factor, TT is the final time step of complete episodes to update cars: REINFORCE can help
the episode and RkRk is the the policy, it tends to be improve how self-driving cars
reward received at time step kk. sample-inefficient. The agent decide to drive safely and
Policy Gradient Update: The may have to spend a lot of efficiently by rewarding good
policy parameters θθ are time trying things out before it driving decisions.
updated using the following gets helpful feedback to learn
formula: from. Principal Component
θt+1=θt+α∇θlogπθ(at∣st)Gtθt+1 Convergence Issues: Analysis(PCA)
=θt+α∇θlogπθ(at∣st)Gt Because the results can be PCA (Principal Component
Where: very random and learning is Analysis) is a dimensionality
αα is the learning rate. slow REINFORCE needs a lot reduction technique used in
πθ(at∣st)πθ(at∣st) is the of practice before it learns a data analysis and machine
probability of taking action atat at good way to act. learning. It helps you to reduce
state stst, according to the Variants of REINFORCE the number of features in a
policy. Several modifications to the dataset while keeping the
GtGt is the return or original REINFORCE algorithm most important information. It
cumulative reward obtained have been proposed to address changes your original features
from time step tt onwards. its high variance: into new features these new
The Baseline: By subtracting a features don’t overlap with each
gradient ∇θlogπθ(at∣st)∇θlogπθ baseline value (typically the other and the first few keep most
(at∣st) represents how much the value function V(s)V(s)) from of the important differences
policy probability for action atat the return GtGt, the variance found in the original data.
at state stst should be adjusted of the gradient estimate can Step 1: Standardize the Data
based on the obtained return. be reduced without affecting Different features may have
Repeat: This process is the expected gradient. This different units and scales like
repeated for several results in a variant known as salary vs. age. To compare
episodes, iteratively updating REINFORCE with a baseline. them fairly PCA
the policy in the direction of The update rule becomes: first standardizes the data by
higher rewards. θt+1=θt+α∇θlogπθ(at∣st)(Gt−bt)θ making each feature have:
Advantages of REINFORCE t+1=θt+α∇θlogπθ(at∣st)(Gt−bt) A mean of 0
Easy to Where btbt is the baseline
A standard deviation of 1
Understand: REINFORCE is such as the expected reward
Z=X−μσZ=σX−μ
simple and easy to use and a from state stst.
where:
good way to start learning Actor-Critic: It is a method
about how to improve that use two parts to learn
μμ is the mean of independent
decision in reinforcement better: the actor and the critic. features μ={μ1,μ2,⋯,μm}μ={μ1
learning. The actor chooses what ,μ2,⋯,μm}
Directly Improves action to take while σσ is the standard deviation of
Decisions: It works by the critic checks how good independent features
directly improving the way that action was and give σ={σ1,σ2,⋯,σm}σ={σ1,σ2,⋯,σm}
actions are chosen which is feedback. This helps to make Step 2: Calculate Covariance
helpful when there are many learning more stable and Matrix
possible actions or choices. faster by reducing random Next PCA calculates
Good for Tasks with Clear mistakes. the covariance matrix to see
Endings: It works well when Applications of REINFORCE how features relate to each
tasks have a clear finish and REINFORCE has been applied other whether they increase or
the agent gets a total reward in several domains: decrease together. The
at the end. Robotics: REINFORCE helps covariance between two
Challenges of REINFORCE robots to learn how to do features x1x1 and x2x2 is:
things like picking up objects
cov(x1,x2)=∑i=1n(x1i−x1ˉ)(x2i−x uncorrelated variables to
2ˉ)n−1cov(x1,x2)=n−1∑i=1n(x1i address issues when original
−x1ˉ)(x2i−x2ˉ) features are highly correlated.
Where: 2. Noise Reduction: Eliminates
xˉ1andxˉ2xˉ1andxˉ2 are the components with low variance
mean values of enhance data clarity.
features x1andx2x1andx2 3. Data
Compression: Represents
nn is the number of data
data with fewer components
points
reduce storage needs and
The value of covariance can be
speeding up processing.
positive, negative or zeros.
4. Outlier Detection: Identifies
Step 3: Find the Principal
unusual data points by
Components showing which ones deviate
PCA identifies new axes where significantly in the reduced
the data spreads out the most: space.
1st Principal Component Disadvantages of Principal
(PC1): The direction of Component Analysis
maximum variance (most 1. Interpretation
spread). Challenges: The new
2nd Principal Component components are combinations
(PC2): The next best of original variables which can
direction, perpendicular to be hard to explain.
PC1 and so on. 2. Data Scaling
These directions come from Sensitivity: Requires proper
the eigenvectors of the scaling of data before
covariance matrix and their application or results may be
importance is measured misleading.
by eigenvalues. For a square 3. Information Loss: Reducing
matrix A an eigenvector X (a dimensions may lose some
non-zero vector) and its important information if too
corresponding eigenvalue λ few components are kept.
satisfy: 4. Assumption of
AX=λXAX=λX Linearity: Works best when
This means: relationships between
When A acts on X it only variables are linear and may
stretches or shrinks X by the struggle with non-linear data.
scalar λ. 5. Computational
The direction of X remains Complexity: Can be slow and
unchanged hence resource-intensive on very
eigenvectors define "stable large datasets.
directions" of A. 6. Risk of Overfitting: Using
Eigenvalues help rank these too many components or
directions by importance. working with a small dataset
Step 4: Pick the Top might lead to models that
don't generalize well.
Directions & Transform Data
After calculating the eigenvalues
and eigenvectors PCA ranks
them by the amount of
information they capture. We
then:
1. Select the top k
components hat capture
most of the variance like 95%.
2. Transform the original
dataset by projecting it onto
these top components.
Advantages of Principal
Component Analysis
1. Multicollinearity
Handling: Creates new,