3/14/22, 4:09 PM Train_Test_Splitting
Why do we need train and test samples
A very common issue when training a model is overfitting.
This phenomenon occurs when a model performs really well on the data that we used to train it
but it fails to generalise well to new, unseen data points.
There are numerous reasons why this can happen — it could be due to the noise in data or it
could be that the model learned to predict specific inputs rather than the predictive parameters
that could help it make correct predictions.
Typically, the higher the complexity of a model the higher the chance that it will be overfitted.
On the other hand, underfitting occurs when the model has poor performance even on the data
that was used to train it.
In most cases, underfitting occurs because the model is not suitable for the problem you are
trying to solve.
Usually, this means that the model is less complex than required in order to learn those
parameters that can be proven to be predictive.
Creating different data samples for training and testing the model is the most common
approach that can be used to identify these sort of issues.
In this way, we can use the training set for training our model and then treat the testing set as a
collection of data points that will help us evaluate whether the model can generalise well to new,
unseen data.
The simplest way to split the modelling dataset into training and testing sets is to assign 2/3
data points to the former and the remaining one-third to the latter.
Therefore, we train the model using the training set and then apply the model to the test set. In
this way, we can evaluate the performance of our model.
For instance, if the training accuracy is extremely high while the testing accuracy is poor then
this is a good indicator that the model is probably overfitted.
Note that splitting the dataset into training and testing sets is not the only action that could be
required in order to avoid phenomenons such as overfitting.
For instance, if both the training and testing sets contain patterns that do not exist in real world
data then the model would still have poor performance even though we wouldn’t be able to
observe it from the performance evaluation.
On a second note, you should be aware that there are certain situations you should consider
creating an extra set called the validation set.
The validation set is usually required when apart from model performance we also need to
choose among many models and evaluate which model performs better.
How to split our dataset into train and test
localhost:8888/nbconvert/html/OneDrive/Bishal/OneDrive/Python learning/Linear Regression/Train_Test_Splitting.ipynb?download=false 1/3
3/14/22, 4:09 PM Train_Test_Splitting
sets
In this section, we are going to explore 2 different ways one can use to create training and
testing sets.
Before jumping into these approaches, let’s create a dummy dataset that will use for
demonstration purposes.
In the examples below, we will assume that we have a dataset stored in memory as a pandas
DataFrame.
The iris dataset contains 150 data points, each of which has four features.
In [1]:
import pandas as pd
from sklearn.datasets import load_iris
In [2]:
iris_data = load_iris()
df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
print(df)
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
.. ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
[150 rows x 4 columns]
In [11]:
df.describe()
Out[11]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
Using pandas
The first option is to use pandas DataFrames’ method sample():
localhost:8888/nbconvert/html/OneDrive/Bishal/OneDrive/Python learning/Linear Regression/Train_Test_Splitting.ipynb?download=false 2/3
3/14/22, 4:09 PM Train_Test_Splitting
Return a random sample of items from an axis of object. You can use random_state for
reproducibility
We initially create the training set by taking a sample with a fraction of 0.8 from the overall rows
in the pandas DataFrame.
Note that we also define random_state which corresponds to the seed, so that results are
reproducible.
Subsequently, we create the testing set by simply dropping the corresponding indices from the
original DataFrame which are now included in the training set.
In [3]:
training_data = df.sample(frac=0.8, random_state=25)
testing_data = df.drop(training_data.index)
print(f"No. of training examples: {training_data.shape[0]}")
print(f"No. of testing examples: {testing_data.shape[0]}")
No. of training examples: 120
No. of testing examples: 30
Using scikit-learn
The second option — and probably the most commonly used — is the use of sklearn ‘s method
called train_test_split():
Split arrays or matrices into random train and test subsets
We can create both the training and testings sets in a one-liner by passing to train_test_split()
the modelling DataFrame along with the fraction of the examples that should be included in the
testing set.
As before, we also set a random_state so that the results are reproducible, that is every time we
run the code, the same instances will be included in the training and testing sets respectively.
The method returns a tuple with two DataFrames containing the training and testing examples.
In [4]:
from sklearn.model_selection import train_test_split
training_data, testing_data = train_test_split(df, test_size=0.2, random_state=25)
print(f"No. of training examples: {training_data.shape[0]}")
print(f"No. of testing examples: {testing_data.shape[0]}")
No. of training examples: 120
No. of testing examples: 30
In [ ]:
localhost:8888/nbconvert/html/OneDrive/Bishal/OneDrive/Python learning/Linear Regression/Train_Test_Splitting.ipynb?download=false 3/3