Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views40 pages

Unit 3 ML

This document outlines the process of training machine learning models, emphasizing the importance of data preparation, model evaluation, and the concepts of overfitting, bias, and variance. It explains the necessity of splitting datasets into training and test sets, tuning hyperparameters, and using cross-validation to ensure reliable performance estimates. Additionally, it discusses the trade-off between bias and variance, the characteristics of models suffering from high bias or variance, and the evaluation metrics for assessing model performance.

Uploaded by

khansanazeer03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views40 pages

Unit 3 ML

This document outlines the process of training machine learning models, emphasizing the importance of data preparation, model evaluation, and the concepts of overfitting, bias, and variance. It explains the necessity of splitting datasets into training and test sets, tuning hyperparameters, and using cross-validation to ensure reliable performance estimates. Additionally, it discusses the trade-off between bias and variance, the characteristics of models suffering from high bias or variance, and the evaluation metrics for assessing model performance.

Uploaded by

khansanazeer03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Selection of model

How to Train ML Models


At last, it’s time to build our models!
It might seem like it took us a while to get here, but professional data scientists actually spend
the
bulk of their time on the steps leading up to this one:
1. Exploring the data.
2. Cleaning the data.
3. Engineering new features.
Again, that’s because better data beats fancier algorithms.
In this lesson, you'll learn how to set up the entire modeling process to maximize performance
while safeguarding against overfitting. We will swap algorithms in and out and automatically
find the best parameters for each one.

Split Dataset
Let’s start with a crucial but sometimes overlooked step: Spending your data.
Think of your data as a limited resource.
You can spend some of it to train your model (i.e. feed it to the algorithm).
You can spend some of it to evaluate (test) your model.
But you can’t reuse the same data for both!
If you evaluate your model on the same data you used to train it, your model could be very
overfit and you wouldn’t even know! A model should be judged on its ability to predict new,
unseen data.
Therefore, you should have separate training and test subsets of your dataset.

Training sets are used to fit and tune your models. Test sets are put aside as "unseen" data to
evaluate your models.
You should always split your data before doing anything else.
This is the best way to get reliable estimates of your models’ performance.
After splitting your data, don’t touch your test set until you’re ready to choose your final
model!
Comparing test vs. training performance allows us to avoid overfitting... If the model performs
very well on the training data but poorly on the test data, then it’s overfit.
What are Hyperparameters?
So far, we’ve been casually talking about "tuning" models, but now it’s time to treat the topic
more formally.
When we talk of tuning models, we specifically mean tuning hyperparameters.
There are two types of parameters in machine learning algorithms.
The key distinction is that model parameters can be learned directly from the training data while
hyperparameters cannot.

Model parameters
Model parameters are learned attributes that define individual models.
e.g. regression coefficients
e.g. decision tree split locations
They can be learned directly from the training data

Hyperparameters
Hyperparametersexpress "higher-level" structural settings for algorithms.
e.g. strength of the penalty used in regularized regression
e.g. the number of trees to include in a random forest
They aredecidedbefore fitting the model because they can't be learned from the data

What is Cross-Validation?
Next, it’s time to introduce a concept that will help us tune our models: cross-validation.
Cross-validation is a method for getting a reliable estimate of model performance using only
your training data.
There are several ways to cross-validate. The most common one, 10-fold cross-validation,
breaks your training data into 10 equal parts (a.k.a. folds), essentially creating 10 miniature
train/test splits.
These are the steps for 10-fold cross-validation:
1. Split your data into 10 equal parts, or "folds".
2. Train your model on 9 folds (e.g. the first 9 folds).
3. Evaluate it on the 1 remaining "hold-out" fold.
4. Perform steps (2) and (3) 10 times, each time holding out a different fold.
5. Average the performance across all 10 hold-out folds.
The average performance across the 10 hold-out folds is your final performance estimate, also
called your cross-validated score. Because you created 10 mini train/test splits, this score is
usually
pretty reliable.

Select Winning Model


By now, you'll have 1 "best" model for each algorithm that has been tuned through cross-
validation.
Most importantly, you've only used the training data so far.
Now it’s time to evaluate each model and pick the best one, a la Hunger Games style.
Because you've saved your test set as a truly unseen dataset, you can now use it get a reliable
estimate of each models' performance.
There are a variety of performance metrics you could choose from. We won't spend too much
time
on them here, but in general:
For regression tasks, we recommend Mean Squared Error (MSE) o rMean Absolute Error
(MAE). (Lower values are better)
For classification tasks, we recommend Area Under ROC Curve (AUROC).(Higher
values are better)
The process is very straightforward:
1. For each of your models, make predictions on your test set.
2. Calculate performance metrics using those predictions and the "ground truth" target variable
from the test set
Finally, use these questions to help you pick the winning model:
Which model had the best performance on the test set? (performance)
Does it perform well across various performance metrics?(robustness)
Did it also have (one of) the best cross-validated scores from the training set?(consistency)
Does it solve the original business problem?(win condition)
Validation strategies can be broadly divided into 2 categories: Holdout validation and cross
validation.

Holdout validation
Within holdout validation we have 2 choices: Single holdout and repeated holdout.

a) Single Holdout
Implementation
The basic idea is to split our data into a training set and a holdout test set. Train the model on the
training set and then evaluate model performance on the test set. We take only a single holdout—
hence the name. Let’s walk through the steps:

Step 1: Split the labelled data into 2 subsets (train and test).

Step 2: Choose a learning algorithm. (For ex: Random Forest). Fix values of hyperparameters.
Train the model to learn the parameters.

Step 3: Predict on the test data using the trained model. Choose an appropriate metric for
performance estimation (ex: accuracy for a classification task). Assess predictive performance by
comparing predictions and ground truth.

Step 4: If the performance estimate computed in the previous step is satisfactory, combine the
train and test subset to train the model on the full data with the same hyper-parameters.
Overfitting
Overfitting is one of the greatest concerns in predictive analytics and machine learning.
Overfitting refers to a situation where the model chosen to fit the training data fits too well, and
essentially captures all of the noise, outliers, and so on.
The consequence of this is that the model will fit the training data very well, but will not
accurately predict cases not represented by the training data, and therefore will not generalize
well to unseen data. This means that the model performance will be better with the training data
than with the test data.

A model is said to have high variance when it leans more towards overfitting, and conversely has
high bias when it doesn’t fit the data well enough. A high variance model will tend to be quite
flexible and overly complex, while a high bias model will tend to be very opinionated and overly
simplified. A good example of a high bias model is fitting a straight line to very nonlinear data.
In both cases, the model will not make very accurate predictions on new data. The ideal situation
is to find a model that is not overly biased, nor does it have a high variance. Finding this balance
is one of the key skills of a data scientist.

Overfitting can occur for many reasons. A common one is that the training data consists of many
features relative to the number of observations or data points. In this case, the data is relatively
wide as compared to long. To address this problem, reducing the number of features can help, or
finding more data if possible. The downside to reducing features is that you lose potentially
valuable information. Another option is to use a technique called regularization, which will be
discussed later in this series.

Overview of Bias and Variance


A machine learning model’s performance is considered good based on it prediction and how well
it generalizes on an independent test dataset. When we have an input x and we apply a
function f on the input x to predict an output y. Difference between the actual output and
predicted output is the error. Our goal with machine learning algorithm is to generate a model
which minimizes the error of the test dataset. Models are assessed based on the prediction error
on a new test dataset.

Error in our model is summation of reducible and irreducible error.


Irreducible Error
Errors that cannot be reduced no matter what algorithm you apply is called an irreducible error.
It is
usually caused by unknown variables that may be having an influence on the output variable.
Reducible Error has two components — bias and variance.
Presence of bias or variance causes overfitting or underfitting of data.
Bias:
Bias are the simplifying assumptions made by a model to make the target function easier to
learn.
Bias is used to allow the Machine Learning Model to learn in a simplified manner. the simplest
model that is able to learn the entire dataset and predict correctly on it is the best model.
Bias is how far are the predicted values from the actual values. If the average predicted values
are far off from the actual values then the bias is high.
Generally, linear algorithms have a high bias making them fast to learn and easier to understand
but generally less flexible. In turn, they have lower predictive performance on complex problems
that fail to meet the simplifying assumptions of the algorithms bias.
• Low Bias: Suggests less assumptions about the form of the target function.
• High-Bias: Suggests more assumptions about the form of the target function.
Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest
Neighbors and Support Vector Machines.
Examples of high-bias machine learning algorithms include: Linear Regression, Linear
Discriminant Analysis and Logistic Regression.

Variance Error
Variance is the amount that the estimate of the target function will change if different training
data was used. The target function is estimated from the training data by a machine learning
algorithm, so we should expect the algorithm to have some variance. Ideally, it should not
change too much from one training dataset to the next, meaning that the algorithm is good at
picking out the hidden underlying mapping between the inputs and the output variables.
Machine learning algorithms that have a high variance are strongly influenced by the specifics of
the training data. This means that the specifics of the training have influences the number and
types of parameters used to characterize the mapping function.

• Low Variance: Suggests small changes to the estimate of the target function with changes to
the training dataset.
• High Variance: Suggests large changes to the estimate of the target function with changes to
the training dataset.
Generally, nonlinear machine learning algorithms that have a lot of flexibility have a high
variance. For example, decision trees have a high variance, that is even higher if the trees are not
pruned before use.
Examples of low-variance machine learning algorithms include: Linear Regression, Linear
Discriminant Analysis and Logistic Regression.

Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest


Neighbors and Support Vector Machines.

Low Bias — High Variance:


A low bias and high variance problem is overfitting. Different data sets are depicting
insights given their respective dataset. Hence, the models will predict differently.
However, if average the results, we will have a pretty accurate prediction.

High Bias — Low Variance:


The predictions will be similar to one another but on average, they are inaccurate.

Bias and variance using bulls-eye diagram

In supervised learning, underfitting happens when a model unable to capture the underlying
pattern of the data. These models usually have high bias and low variance. It happens when we
have very less amount of data to build an accurate model or when we try to build a linear model
with a nonlinear data. Also, these kind of models are very simple to capture the complex patterns
in data like Linear and logistic regression.

Characteristics of a biased model


A biased model will have the following characteristics:
• Underfitting: A model with high bias is simpler than it should be and hence tends to
underfit the data. In other words, the model fails to learn and acquire the intricate patterns of
the dataset.
• Low Training Accuracy: A biased model will not fit the Training Dataset properly and
hence will have low training accuracy (or high training loss).
• Inability to solve complex problems: A Biased model is too simple and hence is often
incapable of learning complex features and solving relatively complex problems.

Characteristics of a model with Variance


A model with high Variance will have the following characteristics:

• Overfitting: A model with high Variance will have a tendency to be overly complex. This
causes the overfitting of the model.
• Low Testing Accuracy: A model with high Variance will have very high training accuracy
(or very low training loss), but it will have a low testing accuracy (or a low testing loss).
• Overcomplicating simpler problems: A model with high variance tends to be overly
complex and ends up fitting a much more complex curve to a relatively simpler data. The
model is thus capable of solving complex problems but incapable of solving simple
problems efficiently.
Bias-Variance Trade-Off
The goal of any supervised machine learning algorithm is to achieve low bias and low variance.
In
turn the algorithm should achieve good prediction performance.
You can see a general trend in the examples above:
• Linear machine learning algorithms often have a high bias but a low variance.
• Nonlinear machine learning algorithms often have a low bias but a high variance.
The parameterization of machine learning algorithms is often a battle to balance out bias and
variance.
The relationship between bias and variance in statistical learning is such that:
• Increasing bias will decrease variance.
• Increasing variance will decrease bias.
There is a trade-off at play between these two concerns and the models we choose and the way
we
choose to configure them are finding different balances in this trade-off for our problem.
In both the regression and classification settings, choosing the correct level of flexibility is
critical
to the success of any statistical learning method. The bias-variance trade-off, and the resulting
Ushape
in the test error, can make this a difficult task.
. In the graph shown below, the green dotted line represents variance, the blue dotted line
represents bias and the red solid line represents the error in the prediction of the concerned
model.
• Since bias is high for a simpler model and decreases with an increase in model complexity,
the line representing bias exponentially decreases as the model complexity increases.
• Similarly, Variance is high for a more complex model and is low for simpler models. Hence,
the line representing variance increases exponentially as the model complexity increases.
• Finally, it can be seen that on either side, the generalization error is quite high. Both high
bias and high variance lead to a higher error rate.
• The most optimal complexity of the model is right in the middle, where the bias and
variance intersect. This part of the graph is shown to produce the least error and is
preferred.
• Also, as discussed earlier, the model underfits for high-bias situations and overfits for
highvariance
situations.

etection of Bias and Variance of a model


In model building, it is imperative to have the knowledge to detect if the model is suffering from
high bias or high variance. The methods to detect high bias and variance is given below:

1. Detection of High Bias:


• The model suffers from a very High Training Error.
• The Validation error is similar in magnitude to the training error.
• The model is underfitting.

2. Detection of High Variance:


• The model suffers from a very Low Training Error.
• The Validation error is very high when compared to the training error.
• The model is overfitting.
A graphical method to Detect a model suffering from High Bias and Variance is shown below:

The graph shows the change in error rate with respect to model complexity for training and
validation error.
• The left portion of the graph suffers from High Bias. This can be seen as the training error is
quite high along with the validation error. In addition to that, model complexity is quite low.
• The right portion of the graph suffers from High Variance. This can be seen as the training
error is very low, yet the validation error is very high and starts increasing with increasing
model complexity.
Evaluation of machine learning algorithms
Performance Metrics for Classification problems in Machine Learning
Evaluating your machine learning algorithm is an essential part of any project. Your model may
give you satisfying results when evaluated using a metric say accuracy_score but may give poor
results when evaluated against other metrics such as logarithmic_loss or any other such metric.
Most of the times we use classification accuracy to measure the performance of our model,
however
it is not enough to truly judge our model.

Classification Accuracy

Classification Accuracy is what we usually mean, when we use the term accuracy. It is the ratio
of number of correct predictions to the total number of input samples.
It works well only if there are equal number of samples belonging to each class.
For example, consider that there are 98% samples of class A and 2% samples of class B in our
training set. Then our model can easily get 98% training accuracy by simply predicting every
training sample belonging to class A.
When the same model is tested on a test set with 60% samples of class A and 40% samples of
class B, then the test accuracy would drop down to 60%.
The real problem arises, when the cost of misclassification of the minor class samples are very
high. If we deal with a rare but fatal disease, the cost of failing to diagnose the disease of a sick
person is much higher than the cost of sending a healthy person to more tests.

Logarithmic Loss

Logarithmic Loss or Log Loss, works by penalising the false classifications. It works well for
multiclass
classification. When working with Log Loss, the classifier must assign probability to each
class for all the samples. Suppose, there are N samples belonging to M classes, then the Log Loss
is
calculated as below :

where,
y_ij, indicates whether sample i belongs to class j or not
p_ij, indicates the probability of sample i belonging to class j
Log Loss has no upper bound and it exists on the range [0, ∞). Log Loss nearer to 0 indicates
higher accuracy, whereas if the Log Loss is away from 0 then it indicates lower accuracy.
In general, minimising Log Loss gives greater accuracy for the classifier.

Confusion Matrix
It is the easiest way to measure the performance of a classification problem where the output can
be of two or more type of classes. A confusion matrix is nothing but a table with two dimensions
viz “Actual” and “Predicted” and furthermore, both the dimensions have “True Positives (TP)”,
“True Negatives (TN)”, “False Positives (FP)”, “False Negatives (FN)” as shown below –

Terms associated with Confusion matrix:

1. True Positives (TP): True positives are the cases when the actual class of the data point was
1(True) and the predicted is also 1(True)
Ex: The case where a person is actually having cancer(1) and the model classifying his case as
cancer(1) comes under True positive.

2. True Negatives (TN): True negatives are the cases when the actual class of the data point was
0(False) and the predicted is also 0(False
Ex: The case where a person NOT having cancer and the model classifying his case as Not
cancer comes under True Negatives.

3. False Positives (FP): False positives are the cases when the actual class of the data point was
0(False) and the predicted is 1(True). False is because the model has predicted incorrectly and
positive because the class predicted was a positive one. (1)
Ex: A person NOT having cancer and the model classifying his case as cancer comes under
False Positives.

4. False Negatives (FN): False negatives are the cases when the actual class of the data point
was
1(True) and the predicted is 0(False). False is because the model has predicted incorrectly and
negative because the class predicted was a negative one. (0)
Ex: A person having cancer and the model classifying his case as No-cancer comes under False
Negatives.
The ideal scenario that we all want is that the model should give 0 False Positives and 0 False
Negatives. But that’s not the case in real life as any model will NOT be 100% accurate most of
the times.

When to use Accuracy:

Ex:60% classes in our fruits images data are apple and 40% are oranges.
A model which predicts whether a new image is Apple or an Orange, 97% of times correctly is a
very good measure in this example.

When NOT to use Accuracy:


Accuracy should NEVER be used as a measure when the target variable classes in the data are a
majority of one class.

Classification Report
This report consists of the scores of Precisions, Recall, F1 and Support. They are explained as
follows –

Precision
Precision, used in document retrievals, may be defined as the number of correct documents
returned
by our ML model. We can easily calculate it by confusion matrix with the help of following
formula

Precision=TP/TP+FN

When to use?
Precision is a valid choice of evaluation metric when we want to be very sure of our prediction.
For
example: If we are building a system to predict if we should decrease the credit limit on a
particular
account, we want to be very sure about our prediction or it may result in customer dissatisfaction.
Recall or Sensitivity
Recall may be defined as the number of positives returned by our ML model. We can easily
calculate it by confusion matrix with the help of following formula.
Recall=TP/TP+FN
When to use?
Recall is a valid choice of evaluation metric when we want to capture as many positives as
possible.
For example: If we are building a system to predict if a person has cancer or not, we want to
capture
the disease even if we are not very sure.
Specificity
Specificity, in contrast to recall, may be defined as the number of negatives returned by our ML
model. We can easily calculate it by confusion matrix with the help of following formula −
Specificity=TN/TN+FP

Support
Support may be defined as the number of samples of the true response that lies in each class of
target values.

F1 Score
This score will give us the harmonic mean of precision and recall. Mathematically, F1 score is
the
weighted average of the precision and recall. The best value of F1 would be 1 and worst would
be
0. We can calculate F1 score with the help of following formula −
F1=2∗(precision∗recall)/(precision+recall)
F1 score is having equal relative contribution of precision and recall.

When to use?
We want to have a model with both good precision and recall.

Simply stated the F1 score sort of maintains a balance between the precision and recall for
your
classifier. If your precision is low, the F1 is low and if the recall is low again your F1 score is
low.

AUC (Area Under ROC curve)


AUC (Area Under Curve)-ROC (Receiver Operating Characteristic) is a performance metric,
based
on varying threshold values, for classification problems. As name suggests, ROC is a probability
curve and AUC measure the separability. In simple words, AUC-ROC metric will tell us about
the
capability of model in distinguishing the classes. Higher the AUC, better the model.
Mathematically, it can be created by plotting TPR (True Positive Rate) i.e. Sensitivity or recall
vs
FPR (False Positive Rate) i.e. 1-Specificity, at various threshold values. Following is the graph
showing ROC, AUC having TPR at y-axis and FPR at x-axis –

We can use roc_auc_score function of sklearn.metrics to compute AUC-ROC.


When to Use?
AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute
values. So, for example, if you as a marketer want to find a list of users who will respond to a
marketing campaign. AUC is a good metric to use since the predictions ranked by probability is
the
order in which you will create a list of users to send the marketing campaign.
Another benefit of using AUC is that it is classification-threshold-invariant like log loss. It
measures the quality of the model’s predictions irrespective of what classification threshold is
chosen, unlike F1 score or accuracy which depend on the choice of threshold.
False Positive Rate or Type I Error: Number of items wrongly identified as positive out of
total
true negatives- FP/(FP+TN)
False Negative Rate or Type II Error: Number of items wrongly identified as negative out of
total
true positives- FN/(FN+TP)
Example:1
let’s assume that we have a prevalence of 50% (50% hospitalized, 50% not
hospitalized). Given 100 patients, we have the following break-down and
performance metrics.
Find the precision, recall,accuracy,sensitivity, specificity and f-measure.
Example:2
In the second example, let’s assume a prevalence of 3% (3% hospitalized, 97% not
hospitalized). Find the precision, recall,accuracy,sensitivity, specificity and
fmeasure.
Example:3
Suppose we have 100 credit card transactions, of which 97 are legit and 3 are fraud
and let’s say we came up a model that predicts everything as fraud.
Introduction to statistical learning

Note: According to sir are the following topics. Below again i have given the same

topic according to internet.

Inferential Statistics
Inferential statistics allows you to make inferences about the population from
the sample data.
Mean is the numerical average of all values, median is directly in the
middle of the data set while mode is the most frequent value in the data set. Spread
(dispersion or variability) is the extent to which a distribution is stretched or squeezed.
Common examples of measures of statistical dispersion are the variance, standard
deviation.
Variance is the average of the squared differences from the mean while standard deviation is
the square root of the variance.
Standard Score or Z score: For an observed value x, the Z score finds the number of standard
deviations x is away from the mean.
Hypothesis Testing

Hypothesis testing is a kind of statistical inference that involves asking a


question, collecting data, and then examining what the data tells us about
how to proceed. The hypothesis to be tested is called the null
hypothesis and given the symbol Ho. We test the null hypothesis against
an alternative hypothesis, which is given the symbol Ha.
Correlation

Correlation refers to a mutual relationship or association between quantitative


variables. It can help in predicting one quantity from another.
It often indicates the presence of a causal relationship. It used as a basic
quantity and foundation for many other modeling techniques.

Regression
Regression analysis is a set of statistical processes for estimating the
relationships among variables.
Simple Regression

This method uses a single independent variable to predict a dependent


variable by fitting the best relationship.

Multiple Regression

This method uses more than one independent variable to predict a


dependent variable by fitting the best relationship.

One of the simple way to understand statistical learning is to determine


association between predictors (independent variables, features) &
response (dependent variable) and developing a accurate model that can
predict response variable (Y) on basis of predictor variables (X).
Y = f(X) + ɛ where X = (X1,X2, . . .,Xp), f is an unknown function & ɛ
is random error (reducible & irreducible).

Prediction & Inference


In situations where a set of inputs X are readily available, but the output
Y is not known, we often treat f as black box (not concerned with the exact
form of f), as long as it yields accurate predictions for Y . This is prediction.
There are situations where we are interested in understanding the way
that Y is affected as X change. In this situation we wish to estimate f, but
our goal is not necessarily to make predictions for Y . Here we are more
interested in understanding relationship between X and Y. Now f cannot
be treated as a black box, because we need to know its exact form. This
is inference.
Parametric & Non-parametric methods

When we make an assumption about the functional form of f and try to


estimate f by estimating the set of parameters, these methods are
called parametric methods.
f(X) = β0 + β1X1 + β2X2 + . . . + βpXp
Non-parametric methods do not make explicit assumptions about the form
of f, instead they seek an estimate of f that gets as close to the data points
as possible.
Prediction Accuracy and Model Interpretability

Of the many methods that we use for statistical learning, some are less
flexible, or more restrictive. When inference is the goal, there are clear
advantages to using simple and relatively inflexible statistical learning
methods. When we are only interested in prediction, we use flexible models
available.

Assessing Model Accuracy

There is no free lunch in statistics, which means no one method dominates


all others over all possible data sets. In the regression setting, the most
commonly-used measure is the mean squared error (MSE). In the
classification setting, the most commonly-used measure is the confusion
matrix. Fundamental property of statistical learning is that, as model
flexibility increases, training error will decrease, but the test error may not.
Bias & Variance

Bias are the simplifying assumptions made by a model to make the target
function easier to learn. Parametric models have a high bias making them
fast to learn and easier to understand but generally less flexible. Decision
Trees, k-Nearest Neighbors and Support Vector Machines are low-bias
machine learning algorithms. Linear Regression, Linear Discriminant
Analysis and Logistic Regression are high-bias machine learning
algorithms.
Variance is the amount that the estimate of the target function will change
if different training data was used. Non-parametric models that have a lot
of flexibility have a high variance. Linear Regression, Linear Discriminant
Analysis and Logistic Regression are low-variance machine learning
algorithms. Decision Trees, k-Nearest Neighbors and Support Vector
Machines are high-variance machine learning algorithms.
Bias-Variance Trade-Off

The relationship between bias and variance in statistical learning is such


that:
• Increasing bias will decrease variance.
• Increasing variance will decrease bias.
There is a trade-off at play between these two concerns and the models we
choose and the way we choose to configure them are finding different
balances in this trade-off for our problem.
In both the regression and classification settings, choosing the correct
level of flexibility is critical to the success of any statistical learning
method.
Classification
Predicting a qualitative response for an observation can be referred to
as classifying that observation, since it involves assigning the observation
to a category or class.
Given a feature vector X and a qualitative response Y taking values in the
set C, the classification task is to build a function C(X) that takes as input
the feature vector X and predicts its value for Y. Often we are more
interested in estimating the probabilities that X belongs to each
category in C.
We might think that regression is perfect for classification task as well.
However, linear regression might produce probabilities less than
zero or bigger than one. Logistic regression is more appropriate.
Logistic Regression

Logistic regression models the probability that y belongs to a particular


category rather than modeling the response itself. It uses the logistic
function to ensure a prediction between 0 and 1. The logistic function takes
the form:

Multiple Logistic Regression

Using a strategy similar to that employed for linear regression, multiple


logistic regression can be generalized as:
Maximum likelihood is also used to estimate β0,β1,…,βp in the case of
multiple logistic regression.

According to internet

Statistical learning theory is a framework for machine learning drawing from


the fields of statistics and functional analysis. Statistical learning theory deals
with the problem of finding a predictive function based on data. Statistical
learning theory has led to successful applications in fields such as computer
vision, speech recognition, and bioinformatics, and baseball.

Introduction

The goals of learning are understanding and prediction. Learning falls into many
categories, including supervised learning, unsupervised learning, online
learning, and reinforcement learning. From the perspective of statistical learning
theory, supervised learning is best understood. Supervised learning involves
learning from a training set of data. Every point in the training is an input-
output pair, where the input maps to an output. The learning problem consists
of inferring the function that maps between the input and the output, such that
the learned function can be used to predict the output from future input.

Depending on the type of output, supervised learning problems are either


problems of regression or problems of classification. If the output takes a
continuous range of values, it is a regression problem. Using Ohm's Law as an
example, a regression could be performed with voltage as input and current as
an output. The regression would find the functional relationship between voltage
and current to be R, such that

U=RI

Classification problems are those for which the output will be an element from a
discrete set of labels. Classification is very common for machine learning
applications. In facial recognition, for instance, a picture of a person's face would
be the input, and the output label would be that person's name. The input would
be represented by a large multidimensional vector whose elements represent
pixels in the picture.

After learning a function based on the training set data, that function is validated
on a test set of data, data that did not appear in the training set.
Regularization
In machine learning problems, a major problem that arises is that of overfitting.
Because learning is a prediction problem, the goal is not to find a function that
most closely fits the (previously observed) data, but to find one that will most
accurately predict output from future input. Empirical risk minimization runs
this risk of overfitting: finding a function that matches the data exactly but does
not predict future output well.

Overfitting is symptomatic of unstable solutions; a small perturbation in the


training set data would cause a large variation in the learned function. It can be
shown that if the stability for the solution can be guaranteed, generalization and
consistency are guaranteed as well. Regularization can solve the overfitting
problem and give the problem stability.

Regularization can be accomplished by restricting the hypothesis space H. A


common example would be restricting H to linear functions: this can be seen as
a reduction to the standard problem of linear regression. H could also be
restricted to polynomial of degree p, exponentials, or bounded functions on L1.
Restriction of the hypothesis space avoids overfitting because the form of the
potential functions are limited, and so does not allow for the choice of a function
that gives empirical risk arbitrarily close to zero.

One example of regularization is Tikhonov regularization. This consists of


minimizing

where γ is a fixed and positive parameter, the regularization parameter.


Tikhonov regularization ensures existence, uniqueness, and stability of the
solution.

What is Ensemble learning?


Ensemble learning techniques attempt to make the performance of the predictive
models better by improving their accuracy. Ensemble learning techniques
attempt to make the performance of the predictive models better by improving
their accuracy. The ensemble methods, also known as committee-based learning
or learning multiple classifier systems train multiple hypotheses to solve the
same problem. One of the most common examples of ensemble modelling is the
random forest trees where a number of decision trees are used to predict
outcomes.

An ensemble contains a number of hypothesis or learners which are usually


generated from training data with the help of a base learning algorithm. Most
ensemble methods use a single base learning algorithm to produce homogenous
base learners or homogenous ensembles and there are also some other methods
which use multiple learning algorithms and thus produce heterogenous
ensembles. Ensemble methods are well known for their ability to boost weak
learners. The main causes of error in learning models are due to noise, bias and
variance.

Ensemble methods help to minimize these factors. These methods are


designed to improve the stability and the accuracy of Machine Learning
algorithms.

Single weak learner

In machine learning, no matter if we are facing a classification or a regression


problem, the choice of the model is extremely important to have any chance to
obtain good results. This choice can depend on many variables of the problem:
quantity of data, dimensionality of the space, distribution hypothesis...

A low bias and a low variance, although they most often vary in opposite
directions, are the two most fundamental features expected for a model. Indeed,
to be able to “solve” a problem, we want our model to have enough degrees of
freedom to resolve the underlying complexity of the data we are working with,
but we also want it to have not too much degrees of freedom to avoid high
variance and be more robust. This is the well known bias-variance tradeoff.

In ensemble learning theory, we call weak learners (or base models) models that
can be used as building blocks for designing more complex models by combining
several of them. Most of the time, these basics models perform not so well by
themselves either because they have a high bias (low degree of freedom models,
for example) or because they have too much variance to be robust (high degree
of freedom models, for example). Then, the idea of ensemble methods is to try
reducing bias and/or variance of such weak learners by combining several of
them together in order to create a strong learner (or ensemble model) that
achieves better performances.

Combine weak learners

In order to set up an ensemble learning method, we first need to select our base
models to be aggregated. Most of the time (including in the well known bagging
and boosting methods) a single base learning algorithm is used so that we have
homogeneous weak learners that are trained in different ways. The ensemble
model we obtain is then said to be “homogeneous”. However, there also exist
some methods that use different type of base learning algorithms: some
heterogeneous weak learners are then combined into an “heterogeneous
ensembles model”.

One important point is that our choice of weak learners should be coherent with
the way we aggregate these models. If we choose base models with low bias
but high variance, it should be with an aggregating method that tends to reduce
variance whereas if we choose base models with low variance but high bias, it
should be with an aggregating method that tends to reduce bias.
Simple Ensemble techniques

Taking the mode of the results

MODE: The mode is a statistical term that refers to the most frequently occurring
number found in a set of numbers.

In this technique, multiple models are used to make predictions for each data
point. The predictions by each model are considered as a separate vote. The
prediction which we get from the majority of the models is used as the final
prediction.

Taking the average of the results

In this technique, we take an average of predictions from all the models and use
it to make the final prediction.
AVERAGE= sum(Rating*Number of people)/Total number of people=
(1*5)+(2*13)+(3*45)+(4*7)+(5*2)/72 = 2.833 =Rounded to nearest integer would
be 3
Taking weighted average of the results
This is an extension of the averaging method. All models are assigned different
weights defining the importance of each model for prediction. For instance, if
about 25 of your responders are professional app developers, while others have
no prior experience in this field, then the answers by these 25 people are given
more importance as compared to the other people.
For example: For posterity, I am trimming down the scale of the example to 5
people
WEIGHTED AVERAGE= (0.3*3)+(0.3*2)+(0.3*2)+(0.15*4)+(0.15*3) =3.15 =
rounded to nearest integer would give us 3
Ensemble methods can be divided into two groups:
• sequential ensemble methods where the base learners are generated
sequentially (e.g. AdaBoost).
The basic motivation of sequential methods is to exploit the dependence
between the base learners. The overall performance can be boosted by weighing
previously mislabeled examples with higher weight.

• parallel ensemble methods where the base learners are generated in parallel
(e.g. Random Forest).
The basic motivation of parallel methods is to exploit independence between
the base learners since the error can be reduced dramatically by averaging.
Most ensemble methods use a single base learning algorithm to produce
homogeneous base learners, i.e. learners of the same type, leading to
homogeneous ensembles.
There are also some methods that use heterogeneous learners, i.e. learners of
different types, leading to heterogeneous ensembles. In order for ensemble
methods to be more accurate than any of its individual members, the base
learners have to be as accurate as possible and as diverse as possible.

Why Use Ensemble Methods

The learning algorithms which output only a single hypothesis tends to suffer
from basically three issues. These issues are the statistical problem, the
computational problem and the representation problem which can be partly
overcome by applying ensemble methods.

Types of Ensemble Methods

BAGGing, or Bootstrap AGGregating. : BAGGing gets its name because it


combines Bootstrapping and Aggregation to form one ensemble model. One way
to reduce the variance of an estimate is to average together multiple estimates.
First, we create random samples of the training data set with replacment (sub
sets of training data set). Then, we build a model (classifier or Decision tree) for
each sample. Finally, results of these multiple models are combined using
average or majority voting. Bagging is only effective when using unstable (i.e. a
small change in the training set can cause a significant change in the model)
non-linear models.

Bagging is the application of the Bootstrap procedure to a high-variance machine


learning algorithm, typically decision trees.

1. Suppose there are N observations and M features. A sample from observation


is selected randomly with replacement(Bootstrapping).
2. A subset of features are selected to create a model with sample of observations
and subset of features.
3. Feature from the subset is selected which gives the best split on the training
data.(Visit my blog on Decision Tree to know more of best split)
4. This is repeated to create many models and every model is trained in parallel.
5. Prediction is given based on the aggregation of predictions from all the
models.
For example, we can train M different trees on different subsets of the data
(chosen randomly with replacement) and compute the ensemble:

The following infographic gives a brief idea of Bagging:


Advantages:

• Handles higher dimensionality data very well.

• Reduces over-fitting of the model.

• Maintains accuracy for missing data.

Disadvantages:

Since final prediction is based on the mean predictions from subset trees, it
won’t give precise values for the classification and regression model.

Boosting-based Ensemble learning: Boosting refers to a group of algorithms


that utilize weighted averages to make weak learners into stronger learners.
Unlike bagging that had each model run independently and then aggregate the
outputs at the end without preference to any model. Boosting is all about
“teamwork”. Each model that runs, dictates what features the next model will
focus on.

The predictions are then combined through a weighted majority vote


(classification) or a weighted sum (regression) to produce the final prediction.
The principal difference between boosting and the committee methods, such as
bagging, is that base learners are trained in sequence on a weighted version of
the data.

Boosting is a sequential technique in which, the first algorithm is trained on the


entire data set and the subsequent algorithms are built by fitting the residuals
of the first algorithm, thus giving higher weight to those observations that were
poorly predicted by the previous model.

Boosting Steps:

• Draw a random subset of training samples d1 without replacement from


the training set D to train a weak learner C1
• Draw second random training subset d2 without replacement from the
training set and add 50 percent of the samples that were previously falsely
classified/misclassified to train a weak learner C2
• Find the training samples d3 in the training set D on which C1 and C2
disagree to train a third weak learner C3
• Combine all the weak learners via majority voting.
The algorithm below describes the most widely used form of boosting
algorithm called AdaBoost, which stands for adaptive boosting.
Advantages:

• Supports different loss function (we have used 'binary:logistic' for this
example).
• Works well with interactions.
Disadvantages:

• Prone to over-fitting.
• Requires careful tuning of different hyper-parameters.
• Summary of differences between Bagging and Boosting

Stacking
Stacking is an ensemble learning technique that combines multiple classification
or regression models via a meta-classifier or a meta-regressor. The base level
models are trained based on a complete training set, then the meta-model is
trained on the outputs of the base level model as features. This technique
consists of basically two phases, in the first phase, a set of base-level classifiers
is generated and in the second phase, a meta-level classifier is learned which
combines the outputs of the base-level classifiers.
The base level often consists of different learning algorithms and therefore
stacking ensembles are often heterogeneous. The algorithm below summarizes
stacking.
This model is used for making predictions on the test set. Below is a step-wise
explanation for a simple stacked ensemble:

1. The train set is split into 10 parts.


2. A base model (suppose a decision tree) is fitted on 9 parts and predictions are
made for the 10th part. This is done for each part of the train set.
3. The base model (in this case, decision tree) is then fitted on the whole train
dataset.
4. Using this model, predictions are made on the test set.
5. Steps 2 to 4 are repeated for another base model (say knn) resulting in another
set of predictions for the train set and test set.
6. The predictions from the train set are used as features to build a new model.
7. This model is used to make final predictions on the test prediction set.

Applications Of Ensemble Methods


1. Ensemble methods can be used as overall diagnostic procedures for a more
conventional model building. The larger the difference in fit quality between one
of the stronger ensemble methods and a conventional statistical model, the more
information that the conventional model is probably missing.

2. Ensemble methods can be used to evaluate the relationships between


explanatory variables and the response in conventional statistical models.
Predictors or basis functions overlooked in a conventional model may surface
with an ensemble approach.

3. With the help of the ensemble method, the selection process could be better
captured and the probability of membership in each treatment group estimated
with less bias.

4. One could use ensemble methods to implement the covariance adjustments


inherent in multiple regression and related procedures. One would “residualized”
the response and the predictors of interest with ensemble methods.

Random Forest
Random Forest, one of the most popular and powerful ensemble method used
today in Machine Learning.

We first discuss the fundamental components of this ensemble learning


algorithm - Decision Trees - and then the underlying algorithm and training
procedures. We will also discuss some variations and advantages of this tool.

Decision Trees in a Nutshell


A decision tree is a Machine Learning algorithm capable of fitting complex
datasets and performing both classification and regression tasks. The idea
behind a tree is to search for a pair of variable-value within the training set and
split it in such a way that will generate the "best" two child subsets. The goal is
to create branches and leafs based on an optimal splitting criteria, a process
called tree growing. Specifically, at every branch or node, a conditional statement
classifies the data point based on a fixed threshold in a specific variable,
therefore splitting the data. To make predictions, every new instance starts in
the root node (top of the tree) and moves along the branches until it reaches a
leaf node where no further branching is possible.
The algorithm used to train a tree is called CART(®) (Classification And
Regression Tree). As we already mentioned, the algorithm seeks the best feature–
value pair to create nodes and branches. After each split, this task is performed
recursively until the maximum depth of the tree is reached or an optimal tree is
found. Depending on the task, the algorithm may use a different metric (Gini
impurity, information gain or mean square error) to measure the quality of the
split. It is important to mention that due to the greedy nature of the CART
algorithm, finding an optimal tree is not guaranteed and usually, a reasonably
good estimation will suffice.
Trees have a high risk of overfitting the training data as well as becoming
computationally complex if they are not constrained and regularized properly
during the growing stage. This overfitting implies a low bias, high variance trade-
off in the model. Therefore, in order to deal with this problem, we use Ensemble
Learning, an approach that allows us to correct this overlearning habit and
hopefully, arrive at better, stronger results.
What is an Ensemble Method?
An ensemble method or ensemble learning algorithm consists of aggregating
multiple outputs made by a diverse set of predictors to obtain better results.
Formally, based on a set of “weak” learners we are trying to use a “strong” learner
for our model. Therefore, the purpose of using ensemble methods is: to average
out the outcome of individual predictions by diversifying the set of predictors,
thus lowering the variance, to arrive at a powerful prediction model that reduces
overfitting our training set.
In our case, a Random Forest (strong learner) is built as an ensemble of Decision
Trees (weak learners) to perform different tasks such as regression and
classification.

How are Random Forests trained?

Random Forests are trained via the bagging method. Bagging or Bootstrap
Aggregating, consists of randomly sampling subsets of the training data, fitting
a model to these smaller data sets, and aggregating the predictions. This method
allows several instances to be used repeatedly for the training stage given that
we are sampling with replacement. Tree bagging consists of sampling subsets of
the training set, fitting a Decision Tree to each, and aggregating their result.
The Random Forest method introduces more randomness and diversity by
applying the bagging method to the feature space. That is, instead of searching
greedily for the best predictors to create branches, it randomly samples elements
of the predictor space, thus adding more diversity and reducing the variance of
the trees at the cost of equal or higher bias. This process is also known as
“feature bagging” and it is this powerful method what leads to a more robust
model.
Let’s see now how to make predictions with Random Forests. Remember that in
a Decision Tree a new instance goes from the root node to the bottom until it is
classified in a leaf node. In the Random Forests algorithm, each new data point
goes through the same process, but now it visits all the different trees in the
ensemble, which are were grown using random samples of both training data
and features. Depending on the task at hand, the functions used for aggregation
will differ. For Classification problems, it uses the mode or most frequent class
predicted by the individual trees (also known as a majority vote), whereas for
Regression tasks, it uses the average prediction of each tree.
Although this is a powerful and accurate method used in Machine Learning, you
should always cross-validate your model as there may be overfitting. Also,
despite its robustness, the Random Forest algorithm is slow, as it has to grow
many trees during training stage and as we already know, this is a greedy
process.
Variations
As we already specified, a Random Forest uses sampled subsets of both the
training data and the feature space, which result in high diversity and
randomness as well as low variance. Now, we can go a step further and introduce
a bit more variety by not only looking for a random predictor but also considering
random thresholds for each of these variables. Therefore, instead of looking for
the optimal pair of feature and threshold for the splitting, it uses random
samples of both to create the different branches and nodes, thus further trading
variance for bias. This ensemble is also known as Extremely Randomized Trees
or Extra-Trees. This model also trades more bias for a lower variance but it is
faster to train as it is not looking for an optimum, like the case of Random
Forests.
Additional Properties
One other important attribute of Random Forests is that they are very useful
when trying to determine feature or variable importance. Because important
features tend to be at the top of each tree and unimportant variables are located
near the bottom.

You might also like