Cross Validation in Machine Learning
In machine learning, simply fitting a model on training data doesn’t guarantee
its accuracy on real-world data. To ensure that your machine learning model
generalizes well and isn’t overfitting, it’s crucial to use effective evaluation
techniques. One such technique is cross-validation, which helps in
assessing the model’s performance on unseen data. This article explores the
process of cross-validation and its various methods.
Getting Started with Cross-Validation
Cross-validation is a statistical method used in machine learning to evaluate
how well a model performs on an independent data set. It involves dividing
the available data into multiple folds or subsets, using one of these folds as a
validation set and training the model on the remaining folds. This process is
repeated multiple times each time using a different fold as the validation set.
Finally the results from each validation step are averaged to produce a more
robust estimate of the model’s performance.
The main purpose of cross validation is to prevent overfitting which occurs
when a model is trained too well on the training data and performs poorly on
new, unseen data. By evaluating the model on multiple validation sets, cross
validation provides a more realistic estimate of the model’s generalization
performance i.e. its ability to perform well on new, unseen data. If you want
to make sure your machine learning model is not just memorizing the training
data but is capable of adapting to real-world data cross-validation is a
commonly used technique.
Types of Cross-Validation
There are several types of cross validation techniques including k-fold cross
validation, leave-one-out cross validation, Holdout
validation and Stratified Cross-Validation. The choice of technique
depends on the size and nature of the data, as well as the specific
requirements of the modeling problem.
1. Holdout Validation
In Holdout Validation we perform training on the 50% of the given dataset
and rest 50% is used for the testing purpose. It’s a simple and quick way to
evaluate a model. The major drawback of this method is that we perform
training on the 50% of the dataset, it may possible that the remaining 50% of
the data contains some important information which we are leaving while
training our model i.e. higher bias.
2. LOOCV (Leave One Out Cross Validation)
In this method we perform training on the whole dataset but leaves only one
data-point of the available dataset and then iterates for each data-point.
In LOOCV the model is trained on �−1n−1 samples and tested on the one
omitted sample repeating this process for each data point in the dataset. It
has some advantages as well as disadvantages also.
An advantage of using this method is that we make use of all data points
and hence it is low bias.
The major drawback of this method is that it leads to higher variation in the
testing model as we are testing against one data point. If the data point is an
outlier it can lead to higher variation. Another drawback is it takes a lot of
execution time as it iterates over ‘the number of data points’ times.
3. Stratified Cross-Validation
It is a technique used in machine learning to ensure that each fold of the
cross-validation process maintains the same class distribution as the entire
dataset. This is particularly important when dealing with imbalanced datasets
where certain classes may be under represented. In this method:
1. The dataset is divided into k folds while maintaining the proportion of
classes in each fold.
2. During each iteration, one-fold is used for testing, and the remaining
folds are used for training.
3. The process is repeated k times, with each fold serving as the test set
exactly once.
Stratified Cross-Validation is essential when dealing with classification
problems where maintaining the balance of class distribution is crucial for the
model to generalize well to unseen data.
4. K-Fold Cross Validation
In K-Fold Cross Validation we split the dataset into k number of subsets
(known as folds) then we perform training on the all the subsets but leave
one(k-1) subset for the evaluation of the trained model. In this method, we
iterate k times with a different subset reserved for testing purpose each time.
Note: It is always suggested that the value of k should be 10 as the lower
value of k takes towards validation and higher value of k leads to LOOCV
method.
Example of K Fold Cross Validation
The diagram below shows an example of the training subsets and evaluation
subsets generated in k-fold cross-validation. Here we have total 25
instances. In first iteration we use the first 20 percent of data for evaluation
and the remaining 80 percent for training ([1-5] testing and [5-25] training)
while in the second iteration we use the second subset of 20 percent for
evaluation and the remaining three subsets of the data for training ([5-10]
testing and [1-5 and 10-25] training) and so on.
Training Set Testing Set
Iteration Observations Observations
1 [5-24] [0-4]
2 [0-4, 10-24] [5-9]
3 [0-9, 15-24] [10-14]
4 [0-14, 20-24] [15-19]
5 [0-19] [20-24]
Each iteration uses different subsets for testing and training, ensuring that all
data points are used for both training and testing.
Comparison between K-Fold Cross-Validation and Hold Out Method
K-Fold Cross-Validation and Hold Out Method are quiet similar and
sometimes they are confusing so here is the quick comparison.
Advantages of K-Fold Cross-Validation:
1. This runs K times faster than Leave One Out cross-validation because
K-fold cross-validation repeats the train/test split K-times.
2. Simpler to examine the detailed results of the testing process.
Advantages of Hold-Out Validation:
1. Faster and simpler for quick model checks.
2. Easy to implement for small datasets with minimal computational
resources.
Advantages and Disadvantages of Cross Validation
Advantages:
1. Overcoming Overfitting: Cross validation helps to prevent overfitting
by providing a more robust estimate of the model’s performance on
unseen data.
2. Model Selection: Cross validation can be used to compare different
models and select the one that performs the best on average.
3. Hyperparameter tuning: Cross validation can be used to optimize the
hyperparameters of a model, such as the regularization parameter, by
selecting the values that result in the best performance on the validation
set.
4. Data Efficient: Cross validation allows the use of all the available data
for both training and validation, making it a more data-efficient method
compared to traditional validation techniques.
Disadvantages:
1. Computationally Expensive: Cross validation can be computationally
expensive, especially when the number of folds is large or when the
model is complex and requires a long time to train.
2. Time-Consuming: Cross validation can be time-consuming, especially
when there are many hyperparameters to tune or when multiple models
need to be compared.
3. Bias-Variance Tradeoff: The choice of the number of folds in cross
validation can impact the bias-variance tradeoff, i.e., too few folds may
result in high bias, while too many folds may result in high variance.