Unit 3 ML
Unit 3 ML
Split Dataset
Let’s start with a crucial but sometimes overlooked step: Spending your data.
Think of your data as a limited resource.
You can spend some of it to train your model (i.e. feed it to the algorithm).
You can spend some of it to evaluate (test) your model.
But you can’t reuse the same data for both!
If you evaluate your model on the same data you used to train it, your model could be very
overfit and you wouldn’t even know! A model should be judged on its ability to predict new,
unseen data.
Therefore, you should have separate training and test subsets of your dataset.
Training sets are used to fit and tune your models. Test sets are put aside as "unseen" data to
evaluate your models.
You should always split your data before doing anything else.
This is the best way to get reliable estimates of your models’ performance.
After splitting your data, don’t touch your test set until you’re ready to choose your final
model!
Comparing test vs. training performance allows us to avoid overfitting... If the model performs
very well on the training data but poorly on the test data, then it’s overfit.
What are Hyperparameters?
So far, we’ve been casually talking about "tuning" models, but now it’s time to treat the topic
more formally.
When we talk of tuning models, we specifically mean tuning hyperparameters.
There are two types of parameters in machine learning algorithms.
The key distinction is that model parameters can be learned directly from the training data while
hyperparameters cannot.
Model parameters
Model parameters are learned attributes that define individual models.
e.g. regression coefficients
e.g. decision tree split locations
They can be learned directly from the training data
Hyperparameters
Hyperparametersexpress "higher-level" structural settings for algorithms.
e.g. strength of the penalty used in regularized regression
e.g. the number of trees to include in a random forest
They aredecidedbefore fitting the model because they can't be learned from the data
What is Cross-Validation?
Next, it’s time to introduce a concept that will help us tune our models: cross-validation.
Cross-validation is a method for getting a reliable estimate of model performance using only
your training data.
There are several ways to cross-validate. The most common one, 10-fold cross-validation,
breaks your training data into 10 equal parts (a.k.a. folds), essentially creating 10 miniature
train/test splits.
These are the steps for 10-fold cross-validation:
1. Split your data into 10 equal parts, or "folds".
2. Train your model on 9 folds (e.g. the first 9 folds).
3. Evaluate it on the 1 remaining "hold-out" fold.
4. Perform steps (2) and (3) 10 times, each time holding out a different fold.
5. Average the performance across all 10 hold-out folds.
The average performance across the 10 hold-out folds is your final performance estimate, also
called your cross-validated score. Because you created 10 mini train/test splits, this score is
usually
pretty reliable.
Holdout validation
Within holdout validation we have 2 choices: Single holdout and repeated holdout.
a) Single Holdout
Implementation
The basic idea is to split our data into a training set and a holdout test set. Train the model on the
training set and then evaluate model performance on the test set. We take only a single holdout—
hence the name. Let’s walk through the steps:
Step 1: Split the labelled data into 2 subsets (train and test).
Step 2: Choose a learning algorithm. (For ex: Random Forest). Fix values of hyperparameters.
Train the model to learn the parameters.
Step 3: Predict on the test data using the trained model. Choose an appropriate metric for
performance estimation (ex: accuracy for a classification task). Assess predictive performance by
comparing predictions and ground truth.
Step 4: If the performance estimate computed in the previous step is satisfactory, combine the
train and test subset to train the model on the full data with the same hyper-parameters.
Overfitting
Overfitting is one of the greatest concerns in predictive analytics and machine learning.
Overfitting refers to a situation where the model chosen to fit the training data fits too well, and
essentially captures all of the noise, outliers, and so on.
The consequence of this is that the model will fit the training data very well, but will not
accurately predict cases not represented by the training data, and therefore will not generalize
well to unseen data. This means that the model performance will be better with the training data
than with the test data.
A model is said to have high variance when it leans more towards overfitting, and conversely has
high bias when it doesn’t fit the data well enough. A high variance model will tend to be quite
flexible and overly complex, while a high bias model will tend to be very opinionated and overly
simplified. A good example of a high bias model is fitting a straight line to very nonlinear data.
In both cases, the model will not make very accurate predictions on new data. The ideal situation
is to find a model that is not overly biased, nor does it have a high variance. Finding this balance
is one of the key skills of a data scientist.
Overfitting can occur for many reasons. A common one is that the training data consists of many
features relative to the number of observations or data points. In this case, the data is relatively
wide as compared to long. To address this problem, reducing the number of features can help, or
finding more data if possible. The downside to reducing features is that you lose potentially
valuable information. Another option is to use a technique called regularization, which will be
discussed later in this series.
Variance Error
Variance is the amount that the estimate of the target function will change if different training
data was used. The target function is estimated from the training data by a machine learning
algorithm, so we should expect the algorithm to have some variance. Ideally, it should not
change too much from one training dataset to the next, meaning that the algorithm is good at
picking out the hidden underlying mapping between the inputs and the output variables.
Machine learning algorithms that have a high variance are strongly influenced by the specifics of
the training data. This means that the specifics of the training have influences the number and
types of parameters used to characterize the mapping function.
• Low Variance: Suggests small changes to the estimate of the target function with changes to
the training dataset.
• High Variance: Suggests large changes to the estimate of the target function with changes to
the training dataset.
Generally, nonlinear machine learning algorithms that have a lot of flexibility have a high
variance. For example, decision trees have a high variance, that is even higher if the trees are not
pruned before use.
Examples of low-variance machine learning algorithms include: Linear Regression, Linear
Discriminant Analysis and Logistic Regression.
In supervised learning, underfitting happens when a model unable to capture the underlying
pattern of the data. These models usually have high bias and low variance. It happens when we
have very less amount of data to build an accurate model or when we try to build a linear model
with a nonlinear data. Also, these kind of models are very simple to capture the complex patterns
in data like Linear and logistic regression.
• Overfitting: A model with high Variance will have a tendency to be overly complex. This
causes the overfitting of the model.
• Low Testing Accuracy: A model with high Variance will have very high training accuracy
(or very low training loss), but it will have a low testing accuracy (or a low testing loss).
• Overcomplicating simpler problems: A model with high variance tends to be overly
complex and ends up fitting a much more complex curve to a relatively simpler data. The
model is thus capable of solving complex problems but incapable of solving simple
problems efficiently.
Bias-Variance Trade-Off
The goal of any supervised machine learning algorithm is to achieve low bias and low variance.
In
turn the algorithm should achieve good prediction performance.
You can see a general trend in the examples above:
• Linear machine learning algorithms often have a high bias but a low variance.
• Nonlinear machine learning algorithms often have a low bias but a high variance.
The parameterization of machine learning algorithms is often a battle to balance out bias and
variance.
The relationship between bias and variance in statistical learning is such that:
• Increasing bias will decrease variance.
• Increasing variance will decrease bias.
There is a trade-off at play between these two concerns and the models we choose and the way
we
choose to configure them are finding different balances in this trade-off for our problem.
In both the regression and classification settings, choosing the correct level of flexibility is
critical
to the success of any statistical learning method. The bias-variance trade-off, and the resulting
Ushape
in the test error, can make this a difficult task.
. In the graph shown below, the green dotted line represents variance, the blue dotted line
represents bias and the red solid line represents the error in the prediction of the concerned
model.
• Since bias is high for a simpler model and decreases with an increase in model complexity,
the line representing bias exponentially decreases as the model complexity increases.
• Similarly, Variance is high for a more complex model and is low for simpler models. Hence,
the line representing variance increases exponentially as the model complexity increases.
• Finally, it can be seen that on either side, the generalization error is quite high. Both high
bias and high variance lead to a higher error rate.
• The most optimal complexity of the model is right in the middle, where the bias and
variance intersect. This part of the graph is shown to produce the least error and is
preferred.
• Also, as discussed earlier, the model underfits for high-bias situations and overfits for
highvariance
situations.
The graph shows the change in error rate with respect to model complexity for training and
validation error.
• The left portion of the graph suffers from High Bias. This can be seen as the training error is
quite high along with the validation error. In addition to that, model complexity is quite low.
• The right portion of the graph suffers from High Variance. This can be seen as the training
error is very low, yet the validation error is very high and starts increasing with increasing
model complexity.
Evaluation of machine learning algorithms
Performance Metrics for Classification problems in Machine Learning
Evaluating your machine learning algorithm is an essential part of any project. Your model may
give you satisfying results when evaluated using a metric say accuracy_score but may give poor
results when evaluated against other metrics such as logarithmic_loss or any other such metric.
Most of the times we use classification accuracy to measure the performance of our model,
however
it is not enough to truly judge our model.
Classification Accuracy
Classification Accuracy is what we usually mean, when we use the term accuracy. It is the ratio
of number of correct predictions to the total number of input samples.
It works well only if there are equal number of samples belonging to each class.
For example, consider that there are 98% samples of class A and 2% samples of class B in our
training set. Then our model can easily get 98% training accuracy by simply predicting every
training sample belonging to class A.
When the same model is tested on a test set with 60% samples of class A and 40% samples of
class B, then the test accuracy would drop down to 60%.
The real problem arises, when the cost of misclassification of the minor class samples are very
high. If we deal with a rare but fatal disease, the cost of failing to diagnose the disease of a sick
person is much higher than the cost of sending a healthy person to more tests.
Logarithmic Loss
Logarithmic Loss or Log Loss, works by penalising the false classifications. It works well for
multiclass
classification. When working with Log Loss, the classifier must assign probability to each
class for all the samples. Suppose, there are N samples belonging to M classes, then the Log Loss
is
calculated as below :
where,
y_ij, indicates whether sample i belongs to class j or not
p_ij, indicates the probability of sample i belonging to class j
Log Loss has no upper bound and it exists on the range [0, ∞). Log Loss nearer to 0 indicates
higher accuracy, whereas if the Log Loss is away from 0 then it indicates lower accuracy.
In general, minimising Log Loss gives greater accuracy for the classifier.
Confusion Matrix
It is the easiest way to measure the performance of a classification problem where the output can
be of two or more type of classes. A confusion matrix is nothing but a table with two dimensions
viz “Actual” and “Predicted” and furthermore, both the dimensions have “True Positives (TP)”,
“True Negatives (TN)”, “False Positives (FP)”, “False Negatives (FN)” as shown below –
1. True Positives (TP): True positives are the cases when the actual class of the data point was
1(True) and the predicted is also 1(True)
Ex: The case where a person is actually having cancer(1) and the model classifying his case as
cancer(1) comes under True positive.
2. True Negatives (TN): True negatives are the cases when the actual class of the data point was
0(False) and the predicted is also 0(False
Ex: The case where a person NOT having cancer and the model classifying his case as Not
cancer comes under True Negatives.
3. False Positives (FP): False positives are the cases when the actual class of the data point was
0(False) and the predicted is 1(True). False is because the model has predicted incorrectly and
positive because the class predicted was a positive one. (1)
Ex: A person NOT having cancer and the model classifying his case as cancer comes under
False Positives.
4. False Negatives (FN): False negatives are the cases when the actual class of the data point
was
1(True) and the predicted is 0(False). False is because the model has predicted incorrectly and
negative because the class predicted was a negative one. (0)
Ex: A person having cancer and the model classifying his case as No-cancer comes under False
Negatives.
The ideal scenario that we all want is that the model should give 0 False Positives and 0 False
Negatives. But that’s not the case in real life as any model will NOT be 100% accurate most of
the times.
Ex:60% classes in our fruits images data are apple and 40% are oranges.
A model which predicts whether a new image is Apple or an Orange, 97% of times correctly is a
very good measure in this example.
Classification Report
This report consists of the scores of Precisions, Recall, F1 and Support. They are explained as
follows –
Precision
Precision, used in document retrievals, may be defined as the number of correct documents
returned
by our ML model. We can easily calculate it by confusion matrix with the help of following
formula
−
Precision=TP/TP+FN
When to use?
Precision is a valid choice of evaluation metric when we want to be very sure of our prediction.
For
example: If we are building a system to predict if we should decrease the credit limit on a
particular
account, we want to be very sure about our prediction or it may result in customer dissatisfaction.
Recall or Sensitivity
Recall may be defined as the number of positives returned by our ML model. We can easily
calculate it by confusion matrix with the help of following formula.
Recall=TP/TP+FN
When to use?
Recall is a valid choice of evaluation metric when we want to capture as many positives as
possible.
For example: If we are building a system to predict if a person has cancer or not, we want to
capture
the disease even if we are not very sure.
Specificity
Specificity, in contrast to recall, may be defined as the number of negatives returned by our ML
model. We can easily calculate it by confusion matrix with the help of following formula −
Specificity=TN/TN+FP
Support
Support may be defined as the number of samples of the true response that lies in each class of
target values.
F1 Score
This score will give us the harmonic mean of precision and recall. Mathematically, F1 score is
the
weighted average of the precision and recall. The best value of F1 would be 1 and worst would
be
0. We can calculate F1 score with the help of following formula −
F1=2∗(precision∗recall)/(precision+recall)
F1 score is having equal relative contribution of precision and recall.
When to use?
We want to have a model with both good precision and recall.
Simply stated the F1 score sort of maintains a balance between the precision and recall for
your
classifier. If your precision is low, the F1 is low and if the recall is low again your F1 score is
low.
Note: According to sir are the following topics. Below again i have given the same
Inferential Statistics
Inferential statistics allows you to make inferences about the population from
the sample data.
Mean is the numerical average of all values, median is directly in the
middle of the data set while mode is the most frequent value in the data set. Spread
(dispersion or variability) is the extent to which a distribution is stretched or squeezed.
Common examples of measures of statistical dispersion are the variance, standard
deviation.
Variance is the average of the squared differences from the mean while standard deviation is
the square root of the variance.
Standard Score or Z score: For an observed value x, the Z score finds the number of standard
deviations x is away from the mean.
Hypothesis Testing
Regression
Regression analysis is a set of statistical processes for estimating the
relationships among variables.
Simple Regression
Multiple Regression
Of the many methods that we use for statistical learning, some are less
flexible, or more restrictive. When inference is the goal, there are clear
advantages to using simple and relatively inflexible statistical learning
methods. When we are only interested in prediction, we use flexible models
available.
Bias are the simplifying assumptions made by a model to make the target
function easier to learn. Parametric models have a high bias making them
fast to learn and easier to understand but generally less flexible. Decision
Trees, k-Nearest Neighbors and Support Vector Machines are low-bias
machine learning algorithms. Linear Regression, Linear Discriminant
Analysis and Logistic Regression are high-bias machine learning
algorithms.
Variance is the amount that the estimate of the target function will change
if different training data was used. Non-parametric models that have a lot
of flexibility have a high variance. Linear Regression, Linear Discriminant
Analysis and Logistic Regression are low-variance machine learning
algorithms. Decision Trees, k-Nearest Neighbors and Support Vector
Machines are high-variance machine learning algorithms.
Bias-Variance Trade-Off
According to internet
Introduction
The goals of learning are understanding and prediction. Learning falls into many
categories, including supervised learning, unsupervised learning, online
learning, and reinforcement learning. From the perspective of statistical learning
theory, supervised learning is best understood. Supervised learning involves
learning from a training set of data. Every point in the training is an input-
output pair, where the input maps to an output. The learning problem consists
of inferring the function that maps between the input and the output, such that
the learned function can be used to predict the output from future input.
U=RI
Classification problems are those for which the output will be an element from a
discrete set of labels. Classification is very common for machine learning
applications. In facial recognition, for instance, a picture of a person's face would
be the input, and the output label would be that person's name. The input would
be represented by a large multidimensional vector whose elements represent
pixels in the picture.
After learning a function based on the training set data, that function is validated
on a test set of data, data that did not appear in the training set.
Regularization
In machine learning problems, a major problem that arises is that of overfitting.
Because learning is a prediction problem, the goal is not to find a function that
most closely fits the (previously observed) data, but to find one that will most
accurately predict output from future input. Empirical risk minimization runs
this risk of overfitting: finding a function that matches the data exactly but does
not predict future output well.
A low bias and a low variance, although they most often vary in opposite
directions, are the two most fundamental features expected for a model. Indeed,
to be able to “solve” a problem, we want our model to have enough degrees of
freedom to resolve the underlying complexity of the data we are working with,
but we also want it to have not too much degrees of freedom to avoid high
variance and be more robust. This is the well known bias-variance tradeoff.
In ensemble learning theory, we call weak learners (or base models) models that
can be used as building blocks for designing more complex models by combining
several of them. Most of the time, these basics models perform not so well by
themselves either because they have a high bias (low degree of freedom models,
for example) or because they have too much variance to be robust (high degree
of freedom models, for example). Then, the idea of ensemble methods is to try
reducing bias and/or variance of such weak learners by combining several of
them together in order to create a strong learner (or ensemble model) that
achieves better performances.
In order to set up an ensemble learning method, we first need to select our base
models to be aggregated. Most of the time (including in the well known bagging
and boosting methods) a single base learning algorithm is used so that we have
homogeneous weak learners that are trained in different ways. The ensemble
model we obtain is then said to be “homogeneous”. However, there also exist
some methods that use different type of base learning algorithms: some
heterogeneous weak learners are then combined into an “heterogeneous
ensembles model”.
One important point is that our choice of weak learners should be coherent with
the way we aggregate these models. If we choose base models with low bias
but high variance, it should be with an aggregating method that tends to reduce
variance whereas if we choose base models with low variance but high bias, it
should be with an aggregating method that tends to reduce bias.
Simple Ensemble techniques
MODE: The mode is a statistical term that refers to the most frequently occurring
number found in a set of numbers.
In this technique, multiple models are used to make predictions for each data
point. The predictions by each model are considered as a separate vote. The
prediction which we get from the majority of the models is used as the final
prediction.
In this technique, we take an average of predictions from all the models and use
it to make the final prediction.
AVERAGE= sum(Rating*Number of people)/Total number of people=
(1*5)+(2*13)+(3*45)+(4*7)+(5*2)/72 = 2.833 =Rounded to nearest integer would
be 3
Taking weighted average of the results
This is an extension of the averaging method. All models are assigned different
weights defining the importance of each model for prediction. For instance, if
about 25 of your responders are professional app developers, while others have
no prior experience in this field, then the answers by these 25 people are given
more importance as compared to the other people.
For example: For posterity, I am trimming down the scale of the example to 5
people
WEIGHTED AVERAGE= (0.3*3)+(0.3*2)+(0.3*2)+(0.15*4)+(0.15*3) =3.15 =
rounded to nearest integer would give us 3
Ensemble methods can be divided into two groups:
• sequential ensemble methods where the base learners are generated
sequentially (e.g. AdaBoost).
The basic motivation of sequential methods is to exploit the dependence
between the base learners. The overall performance can be boosted by weighing
previously mislabeled examples with higher weight.
• parallel ensemble methods where the base learners are generated in parallel
(e.g. Random Forest).
The basic motivation of parallel methods is to exploit independence between
the base learners since the error can be reduced dramatically by averaging.
Most ensemble methods use a single base learning algorithm to produce
homogeneous base learners, i.e. learners of the same type, leading to
homogeneous ensembles.
There are also some methods that use heterogeneous learners, i.e. learners of
different types, leading to heterogeneous ensembles. In order for ensemble
methods to be more accurate than any of its individual members, the base
learners have to be as accurate as possible and as diverse as possible.
The learning algorithms which output only a single hypothesis tends to suffer
from basically three issues. These issues are the statistical problem, the
computational problem and the representation problem which can be partly
overcome by applying ensemble methods.
Disadvantages:
Since final prediction is based on the mean predictions from subset trees, it
won’t give precise values for the classification and regression model.
Boosting Steps:
• Supports different loss function (we have used 'binary:logistic' for this
example).
• Works well with interactions.
Disadvantages:
• Prone to over-fitting.
• Requires careful tuning of different hyper-parameters.
• Summary of differences between Bagging and Boosting
Stacking
Stacking is an ensemble learning technique that combines multiple classification
or regression models via a meta-classifier or a meta-regressor. The base level
models are trained based on a complete training set, then the meta-model is
trained on the outputs of the base level model as features. This technique
consists of basically two phases, in the first phase, a set of base-level classifiers
is generated and in the second phase, a meta-level classifier is learned which
combines the outputs of the base-level classifiers.
The base level often consists of different learning algorithms and therefore
stacking ensembles are often heterogeneous. The algorithm below summarizes
stacking.
This model is used for making predictions on the test set. Below is a step-wise
explanation for a simple stacked ensemble:
3. With the help of the ensemble method, the selection process could be better
captured and the probability of membership in each treatment group estimated
with less bias.
Random Forest
Random Forest, one of the most popular and powerful ensemble method used
today in Machine Learning.
Random Forests are trained via the bagging method. Bagging or Bootstrap
Aggregating, consists of randomly sampling subsets of the training data, fitting
a model to these smaller data sets, and aggregating the predictions. This method
allows several instances to be used repeatedly for the training stage given that
we are sampling with replacement. Tree bagging consists of sampling subsets of
the training set, fitting a Decision Tree to each, and aggregating their result.
The Random Forest method introduces more randomness and diversity by
applying the bagging method to the feature space. That is, instead of searching
greedily for the best predictors to create branches, it randomly samples elements
of the predictor space, thus adding more diversity and reducing the variance of
the trees at the cost of equal or higher bias. This process is also known as
“feature bagging” and it is this powerful method what leads to a more robust
model.
Let’s see now how to make predictions with Random Forests. Remember that in
a Decision Tree a new instance goes from the root node to the bottom until it is
classified in a leaf node. In the Random Forests algorithm, each new data point
goes through the same process, but now it visits all the different trees in the
ensemble, which are were grown using random samples of both training data
and features. Depending on the task at hand, the functions used for aggregation
will differ. For Classification problems, it uses the mode or most frequent class
predicted by the individual trees (also known as a majority vote), whereas for
Regression tasks, it uses the average prediction of each tree.
Although this is a powerful and accurate method used in Machine Learning, you
should always cross-validate your model as there may be overfitting. Also,
despite its robustness, the Random Forest algorithm is slow, as it has to grow
many trees during training stage and as we already know, this is a greedy
process.
Variations
As we already specified, a Random Forest uses sampled subsets of both the
training data and the feature space, which result in high diversity and
randomness as well as low variance. Now, we can go a step further and introduce
a bit more variety by not only looking for a random predictor but also considering
random thresholds for each of these variables. Therefore, instead of looking for
the optimal pair of feature and threshold for the splitting, it uses random
samples of both to create the different branches and nodes, thus further trading
variance for bias. This ensemble is also known as Extremely Randomized Trees
or Extra-Trees. This model also trades more bias for a lower variance but it is
faster to train as it is not looking for an optimum, like the case of Random
Forests.
Additional Properties
One other important attribute of Random Forests is that they are very useful
when trying to determine feature or variable importance. Because important
features tend to be at the top of each tree and unimportant variables are located
near the bottom.