MACHINE LEARNING
BY, ASSI.PROF.SWETA RATHOD (CSE-IITE)
CONTENT
❏ Loss Functions and generalization
❏ Parametric v/s Non-parametric Methods
❏ Evaluating Machine Learning Algorithms & Model
Selection
❏ Statistical Learning Theory
❏ Ensemble Learning
Loss Functions and generalization
A loss function is a mathematical function that quantifies the difference
between predicted and actual values in a machine learning model.
It measures the model's performance and guides the optimization process
by providing feedback on how well it fits the data.
n/m — number of training samples
i — ith training sample in a data set
y(i) — Actual value for the ith training sample
y_hat(i) — Predicted value for the ith training sample
Generalization refers to your model's ability to adapt properly to new,
previously unseen data, drawn from the same distribution as the one used
to create the model.
Parametric v/s Non-parametric Methods
Parametric Methods Non-Parametric Methods
Parametric Methods uses a fixed number Non-Parametric Methods use the flexible
of parameters to build the model. number of parameters to build the
model.
Parametric Methods require lesser data Non-Parametric Methods requires much
than Non-Parametric Methods. more data than Parametric Methods.
Possibility of Low Fitting of data set Possibility of Over fitting of data set.
As far as the computation is considered As far as the computation is considered
these methods are computationally these methods are computationally slower
faster than the Non-Parametric methods than the Parametric methods because of
because of comparatively less data. comparatively more data.
Examples: Logistic Regression, Naïve Examples: KNN, Decision Tree Model,
Bayes Model, etc. etc. Non-Tech Example: Travel
Non-Travel Expense(Train), Outing Expense. Expense(Bus/Train/Car/Flight, etc)
Evaluating Machine Learning Algorithms &
Model Selection
❏ Evaluating Machine Learning Algorithms through
❏ Accuracy
❏ Error rate
❏ Model selection depends on
❏ Nature of the data
❏ Sample size
❏ Intended application of the results.
Statistical Learning Theory
❏ Measures for quantitative data are
❏ Mean
❏ Median
❏ Mode
❏ Percentile
❏ Range
❏ Bias
❏ Variance
❏ Standard deviation
❏ Measures for qualitative data are
❏ Skewness
❏ Kurtosis
Ensemble Methods
❏ Ensemble learning is a machine learning paradigm where multiple
models (often called “weak learners”) are trained to solve the
same problem and combined to get better results.
❏ Ensemble learning in machine learning combines multiple
individual models to create a stronger, more accurate predictive
model. By leveraging the diverse strengths of different models,
ensemble learning aims to mitigate errors, enhance performance,
and increase the overall robustness of predictions, leading to
improved results across various tasks in machine learning and
data analysis.
Ensemble Methods
❏ Why Use Ensemble Techniques?
❏ Reduce Overfitting: By averaging multiple models, ensemble
techniques can smooth out irregularities that may cause
overfitting in a single model.
❏ Improve Accuracy: Combining multiple weak models leads to a
more accurate prediction.
❏ Increase Stability: Ensembles are more robust to variations in
the training data and less sensitive to noise.
Ensemble Methods : Types
Ensemble Methods : Types
❏ There are mainly four types of ensemble methods
1. Voting
2. Bagging
3. Boosting
4. Stacking
❏ Base learners/ Week Learner are the first level of an ensemble
learning architecture, and each one of them is trained to make individual
predictions.
❏ Meta learners/ Strong Learner on the other hand, are in the second
level, and they are trained on the output of the base learners
Ensemble Methods : Types
❏ Types of Ensemble Learning:
❏ Bagging: It is a homogeneous weak learners’ model that learns from
each other independently in parallel and combines them for determining
the model average. [Example: Random Forest]
❏ Boosting: It is also a homogeneous weak learners’ model but works
differently from Bagging. In this model, learners learn sequentially and
adaptively to improve model predictions of a learning algorithm.
[Example: Adaboost, Gradient Boost, Extreme Gradient
Boost(xGBoost)]
Ensemble Methods : Types
❏ Bootstrap Aggregating, also known as bagging, is a machine
learning ensemble meta-algorithm designed to improve the stability
and accuracy of machine learning algorithms used in statistical
classification and regression.
❏ It decreases the variance and helps to avoid overfitting.
❏ It is usually applied to decision tree methods.
Ensemble Methods : Voting
❏ The Voting technique involves combining predictions from multiple
models and making a final decision based on the majority vote. In
voting base models are always different.
❏ There are three main types of voting:
1. Hard Voting
2. Soft Voting
3. Weighted Voting
Ensemble Methods : Voting
1. Hard Voting: Final prediction based on majority votes.
2. Soft Voting: Each model provides a probability estimate for each
class, and the final prediction is based on the average probabilities
3. Weighted Voting: In weighted voting, there is an assumption that
some models have more skill than others and those models are
assigned with more contribution when making predictions
Ensemble Methods : Voting
❏ Example : Suppose we have three classifiers: Model 1, Model 2,
and Model 3, each trained on a dataset to predict whether an email
is spam or not.
Ensemble Methods : Voting
❏ Model 1 predicts “Spam”
❏ Model 2 predicts “Spam”
❏ Model 3 predicts “Not Spam”
❏ In hard voting, the majority class is “Spam,” so the ensemble’s
final prediction is “Spam.”
Ensemble Methods : Voting
❏ In soft voting, if the models provide probability estimates:
❏ Model 1: Prob(Spam) = 0.8, Prob(Not Spam) = 0.2
❏ Model 2: Prob(Spam) = 0.9, Prob(Not Spam) = 0.1
❏ Model 3: Prob(Spam) = 0.3, Prob(Not Spam) = 0.7
❏ The average probabilities are calculated, and since the class with
the highest average probability is “Spam,” the final prediction is
“Spam.”
Ensemble Methods : Voting
In Weighted Voting, each model’s vote is given a specific weight, and the final
prediction is based on the weighted sum of the votes.
❏ Model A: Prob(Spam) = 0.8, Prob(Not Spam) = 0.2 (Weight = 0.5)
❏ Model B: Prob(Spam) = 0.9, Prob(Not Spam) = 0.1 (Weight = 0.2)
❏ Model C: Prob(Spam) = 0.6, Prob(Not Spam) = 0.4 (Weight = 0.3)
The weighted average probabilities for each class are calculated as follows:
❏ Weighted Prob(Spam) = (0.5 * 0.8) + (0.2 * 0.9) + (0.3 * 0.6) = 0.57
❏ Weighted Prob(Not Spam) = (0.5 * 0.2) + (0.2 * 0.1) + (0.3 * 0.4) = 0.21
The final prediction is “ Spam “ because it has the higher weighted probability.
Ensemble Methods : Bagging
❏ Bagging is know as Bootstrap Aggregation. It is based on parallel
processing that involves training multiple instances of the same base
model on different subsets of the training data, promoting diversity
among the base models.
❏ In bagging, the dataset is divided into N different samples, and N
models are trained using these samples, resulting in a final prediction
that is typically obtained by aggregating the predictions of individual
models.
Ensemble Methods : Bagging
❏ Bagging is an ensemble technique that aims to reduce variance and prevent
overfitting by training multiple models independently and then averaging their
predictions. Bagging works by generating different training datasets through
bootstrapping and training a separate model on each of these datasets.
❏ Bootstrap aggregation, or bagging, is a popular ensemble learning technique used in
machine learning to improve the accuracy and stability of classification and
regression models. The basic idea behind bagging is to train multiple models using
random subsets of the training data and then combine their predictions to reduce
variance and improve the overall accuracy of the model.
❏ Bootstrapping
❏ During bootstrapping, we can sample from both the rows and the columns of our
dataset, depending on the ensemble method being used.
Ensemble Methods : Bagging
Ensemble Methods : Bagging
❏ For regression problems, this often involves taking the average of
the predictions, while for classification, it may involve a majority
vote.
❏ Random Forest is popular algorithm that leverages bagging with
decision trees as base models
Ensemble Methods : Bagging [Random Forest]
❏ Bootstrap Aggregating
❏ Row with Resample method while
bootstrapping.
❏ Majority Voting/Aggregating (Majority
Voting in case of classification and
Mean/Median in case of regression)
❏ Decision Tree : If expanded to its full depth,
it leads to low bias and high variance.
Using bagging technique like random
forest we can have low variance as
compared to single decision tree. (due to
aggregation / majority voting)
❏ Random Forest is a classifier that contains
a number of decision trees on various
subsets of the given dataset and takes the
average to improve the predictive
accuracy of that dataset.
Ensemble Methods : Bagging
❏ Random forests do not require a validation dataset. Most random forests use a
technique called out-of-bag-evaluation (OOB evaluation) to evaluate the
quality of the model. OOB evaluation treats the training set as if it were on
the test set of a cross-validation.
❏ Each decision tree in a random forest is typically trained on ~67% of the
training examples. Therefore, each decision tree does not see ~33% of the
training examples.
❏ The core idea of OOB-evaluation is as follows:
❏ To evaluate the random forest on the training set.
❏ For each example, only use the decision trees that did not see the
example during training.
Ensemble Methods : Bagging
❏ Bagging (Bootstrap Aggregating) is an ensemble learning technique
designed to improve the accuracy and stability of machine learning
algorithms.
❏ It involves the following steps:
❏ Data Sampling: Creating multiple subsets of the training dataset using
bootstrap sampling (random sampling with replacement).
❏ Model Training: raining a separate model on each subset of the data.
❏ Aggregation: Combining the predictions from all individual models
(averaged for regression or majority voting for classification) to produce the
final output.
❏ Random Forests (an extension of bagging applied to decision trees)
Ensemble Methods : Bagging
Bagging technique general workflow
Ensemble Methods : Bagging
Ensemble Methods : Bagging
Ensemble Methods : Bagging
How Bagging Works ??...
1. Bootstrap Sampling: Randomly sample data from the training
set with replacement to create multiple subsets (bootstrap
samples). Some data points may be repeated, while others
might be omitted in each sample.
2. Train Models: A separate model (usually a weak learner like a
decision tree) is trained on each bootstrap sample.
3. Aggregate Predictions: For classification tasks, the predictions
from all the models are combined using majority voting. For
regression tasks, the models’ predictions are averaged.
Ensemble Methods : Bagging
Key Benefits:
● Reduces Variance: By averaging multiple predictions, bagging
reduces the variance of the model and helps prevent overfitting.
● Improves Accuracy: Combining multiple models usually leads to
better performance than individual models.
Ensemble Methods : Bagging
Advantages of Bagging:
● Reduces variance: By averaging the predictions of multiple models,
bagging reduces the model’s variance, making it less likely to
overfit.
● Parallelizable: Since each model is trained independently, bagging
can be parallelized easily.
Disadvantages of Bagging:
● Less effective on bias: Bagging primarily reduces variance but
does not significantly reduce bias if the individual models are weak.
Ensemble Methods : Boosting
❏ Boosting improves model’s predictive accuracy and performance
by converting multiple weak learners into a single
strong learning model. Machine learning models can be
weak learners or strong learners.
❏ Weak learners
❏ Weak learners have low prediction accuracy, similar to random
guessing. They are prone to overfitting—that is, they can't
classify data that varies too much from their original dataset.
❏ For example : if you train the model to identify cats as an animal
with pointed ears, it might fail to recognize a cat whose ears are
curled.
Ensemble Methods : Boosting
❏ Strong learners
❏ Strong learners have higher prediction accuracy. Boosting converts a
system of weak learners into a single strong learning system.
❏ For example : to identify the cat image, it combines a weak learner that
guesses for pointed ears and another learner that guesses for cat-shaped
eyes. After analyzing the animal image for pointed ears, the system
analyzes it once again for cat-shaped eyes. This improves the system's
overall accuracy.
❏ Boosting is another ensemble technique that focuses on reducing both bias
and variance by training models sequentially, where each subsequent model
attempts to correct the errors of the previous ones. Unlike bagging, where
models are trained independently, boosting builds models iteratively.
Ensemble Methods : Boosting
❏ Boosting follows a sequential approach, where the prediction of the
current model influences the next one. Each model in the boosting
process iteratively concentrates on observations that were
misclassified by the preceding/previous model. This iterative focus
on the difficult-to-classify instances and helps improve the overall
predictive performance of the ensemble.
❏ Popular boosting algorithms include AdaBoost, Gradient Boosting,
and XGBoost.
Ensemble Methods : Boosting
❏ AdaBoost: Short for Adaptive Boosting, it was one of the first
successful boosting algorithms. It uses decision trees as weak
learners and dynamically adjusts weights based on errors.
❏ Gradient Boosting: This family of algorithms builds models in a
stage-wise fashion, optimizing an objective function using gradient
descent. It’s highly versatile and can handle various tasks.
❏ XGBoost: Extreme Gradient Boosting is a highly efficient and
scalable implementation of gradient boosting. It’s known for its
speed, accuracy, and ability to handle large datasets.
Ensemble Methods : Boosting
❏ What is Boosting?
❏ Boosting is another ensemble learning technique that focuses on creating a
strong model by combining several weak models.
❏ It involves the following steps:
❏ Sequential Training: Training models sequentially, each one trying to correct
the errors made by the previous models.
❏ Weight Adjustment: Each instance in the training set is weighted. Initially, all
instances have equal weights. After each model is trained, the weights of
misclassified instances are increased so that the next model focuses more on
difficult cases.
❏ Model Combination: Combining the predictions from all models to produce the
final output, typically by weighted voting or weighted averaging.
Ensemble Methods : Boosting
Ensemble Methods : Boosting
❏ How boosting works??...
1. Train a base model on the entire dataset.
2. Identify and assign weights to misclassified instances.
3. Train the next model, emphasizing misclassified instances based
on weights.
4. Update instance weights, giving higher importance to previously
misclassified samples.
5. Repeat steps 3–4 until the desired number of models is reached.
6. Combine predictions by giving higher weight to models with better
performance.
7. Achieve a boosted ensemble model with improved accuracy,
particularly on challenging instances.
Ensemble Methods : Boosting(AdaBoost)
Instance F1 F2 Assigned Output
Weight
1 1000 450 1/5 YES
5 2000 475 1/5 YES
3 3000 545 1/5 NO
4 4000 756 1/5 NO
5 5000 898 1/5 YES
Ensemble Methods : Boosting(AdaBoost)
❏ How weights are increased in Boosting
❏ Step 1: Initialize Weight.
❏ Step 2: Train week Learners
❏ Step 3: Calculate Total Error
❏ Step 4: Calculate week Learner Weights ()
❏ Step 5 Calculate Instance Weights(NW)
NW = Old weight * e Performance
❏ REPEAT from step 2
❏ Step 6: Final Prediction.
Ensemble Methods : Boosting(AdaBoost)
❏ Step 1: Assign equal weights to all observations.
Sample weight = 1/N. where N = Total no. of records.
❏ Step 2: Classify random samples using stumps.
❏ Step 3: Calculate total Error.
Total Error = Weights of misclassified records
❏ Step 4: Calculate Performance stump
Performance =
❏ Step 5: Update Weights for correctly and incorrectly classified point
New Weight for misclassified Records = weight * e Performance
New Weight for correctly classified Records = weight * e - Performance
❏ Step 6: Update weights in iteration
❏ Step 7: Final Prediction.
Ensemble Methods : Boosting
Ensemble Methods : Boosting
Ensemble Methods : Boosting
❏ Example of Boosting:
❏ Imagine a boosting model is trained to predict house prices. The
first weak learner might underperform on large houses, but the
second weak learner focuses on improving predictions for these
large houses. The process continues, with each new model
focusing on areas where previous models made errors.
Ensemble Methods : Boosting
❏ Key Benefits:
❏ Reduces Bias: By focusing on hard-to-classify instances, boosting
reduces bias and improves the overall model accuracy.
❏ Produces Strong Predictors: Combining weak learners leads to a
strong predictive model.
Ensemble Methods : Boosting
Advantages of Boosting:
❏ Reduces bias and variance: Boosting reduces both the bias and variance
of the model, resulting in highly accurate predictions.
❏ Works well with weak learners: Even weak models, like shallow decision
trees, can be combined through boosting to create a powerful predictor.
Disadvantages of Boosting:
❏ Sensitive to outliers: Since boosting focuses on correcting errors, it can
place too much emphasis on noisy data points or outliers.
❏ Computationally expensive: Boosting is a sequential process, making it
slower than parallelizable methods like bagging.
❏
Ensemble Methods : Stacking
❏ Stacking is an ensemble learning technique that begins by
generating first-level individual learners from the training dataset,
utilizing various learning algorithms.
❏ The predictions from the base learners are stacked together and
are used as the input to train the meta learner to produce more
robust predictions. The meta learner is then used to make final
predictions.
❏ Stacking (Stacked Generalization) is an ensemble learning
technique that aims to combine multiple models to improve
predictive performance.
Ensemble Methods : Stacking
❏ It combines predictions from multiple (base-level) models to build a
new model (meta-model). This meta-model is used for making
predictions on the test set.
❏ It involves the following steps:
1. Base Models: Training multiple models (level-0 models) on the
same dataset.
❏ Base level algorithms are trained based on a complete training
data-set using k-fold cross validation.
Ensemble Methods : Stacking
2. Meta-Model: Training a new model (level-1 or meta-model) to
combine the predictions of the base models. Using the predictions
of the base models as input features for the meta-model.
❏ Meta model is trained on the prediction combination of all base
level model as feature.
❏ Stacking is useful when the results of the individual algorithms are
very different.
Ensemble Methods : Stacking
Key Benefits:
● Leverages Model Diversity: By combining different types of models, stacking
can capture a wide range of patterns in the data.
● Improves Performance: The meta-model learns how to best combine the
predictions from the base models, often leading to improved performance over
individual models.
Example Process:
● Train several base models (e.g., decision trees, neural networks, SVMs) on the
training data.
● Use the predictions of these base models to create a new dataset.
● Train a meta-model (e.g., linear regression, logistic regression) on this new
dataset to make the final predictions.
Ensemble Methods : Stacking
Ensemble Methods : Stacking
Ensemble Methods : Stacking
Ensemble Methods : Stacking
❏ How stacking works ??...
1. We use initial training data to train m-number of algorithms.
2. Using the output of each algorithm, we create a new training set.
3. Using the new training set, we create a meta-model algorithm.
4. Using the results of the meta-model, we make the final prediction.
The results are combined using weighted averaging.
Advantages of Ensemble Learning
❏ Improved Accuracy: Combining multiple models often results in
higher predictive performance compared to individual models.
❏ Robustness: Ensembles are less affected by outliers and data
noise, enhancing model generalization.
❏ Reduced Overfitting: Ensemble methods help mitigate overfitting
by combining diverse models trained on different subsets of data.
❏ Versatility: Applicable to various machine learning problems,
including classification and regression.
Disadvantages of Ensemble Learning
❏ Complexity: Ensembles may introduce complexity, making them
more challenging to interpret and analyze compared to individual
models.
❏ Computational Resources: Training multiple models can be
computationally intensive, especially for large datasets.
Comparison of Ensemble Learning Methods
Random Forest Algorithm
Random Forest Algorithm
Random Forest Algorithm
Random Forest Algorithm