0% found this document useful (0 votes)

7 views3 pages

Train Test Splitting

The document discusses the importance of splitting datasets into training and testing samples to avoid overfitting and underfitting in model training. It explains the common practice of using a 2/3 to 1/3 ratio for training and testing sets and provides methods for splitting data using pandas and scikit-learn. Additionally, it highlights the potential need for a validation set when comparing multiple models.

Uploaded by

Thiresh Sidda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views3 pages

Train Test Splitting

Uploaded by

Thiresh Sidda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

3/14/22, 4:09 PM Train_Test_Splitting

Why do we need train and test samples

A very common issue when training a model is overfitting.

This phenomenon occurs when a model performs really well on the data that we used to train it
but it fails to generalise well to new, unseen data points.

There are numerous reasons why this can happen — it could be due to the noise in data or it
could be that the model learned to predict specific inputs rather than the predictive parameters
that could help it make correct predictions.

Typically, the higher the complexity of a model the higher the chance that it will be overfitted.

On the other hand, underfitting occurs when the model has poor performance even on the data
that was used to train it.

In most cases, underfitting occurs because the model is not suitable for the problem you are
trying to solve.

Usually, this means that the model is less complex than required in order to learn those
parameters that can be proven to be predictive.

Creating different data samples for training and testing the model is the most common
approach that can be used to identify these sort of issues.

In this way, we can use the training set for training our model and then treat the testing set as a
collection of data points that will help us evaluate whether the model can generalise well to new,
unseen data.

The simplest way to split the modelling dataset into training and testing sets is to assign 2/3
data points to the former and the remaining one-third to the latter.

Therefore, we train the model using the training set and then apply the model to the test set. In
this way, we can evaluate the performance of our model.

For instance, if the training accuracy is extremely high while the testing accuracy is poor then
this is a good indicator that the model is probably overfitted.

Note that splitting the dataset into training and testing sets is not the only action that could be
required in order to avoid phenomenons such as overfitting.

For instance, if both the training and testing sets contain patterns that do not exist in real world
data then the model would still have poor performance even though we wouldn’t be able to
observe it from the performance evaluation.

On a second note, you should be aware that there are certain situations you should consider
creating an extra set called the validation set.

The validation set is usually required when apart from model performance we also need to
choose among many models and evaluate which model performs better.

How to split our dataset into train and test

localhost:8888/nbconvert/html/OneDrive/Bishal/OneDrive/Python learning/Linear Regression/Train_Test_Splitting.ipynb?download=false 1/3
3/14/22, 4:09 PM Train_Test_Splitting

sets
In this section, we are going to explore 2 different ways one can use to create training and
testing sets.

Before jumping into these approaches, let’s create a dummy dataset that will use for
demonstration purposes.

In the examples below, we will assume that we have a dataset stored in memory as a pandas
DataFrame.

The iris dataset contains 150 data points, each of which has four features.

In [1]:
import pandas as pd
from sklearn.datasets import load_iris

In [2]:
iris_data = load_iris()
df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
print(df)

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
.. ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8

[150 rows x 4 columns]

In [11]:
df.describe()

Out[11]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)

count 150.000000 150.000000 150.000000 150.000000

mean 5.843333 3.057333 3.758000 1.199333

std 0.828066 0.435866 1.765298 0.762238

min 4.300000 2.000000 1.000000 0.100000

25% 5.100000 2.800000 1.600000 0.300000

50% 5.800000 3.000000 4.350000 1.300000

75% 6.400000 3.300000 5.100000 1.800000

max 7.900000 4.400000 6.900000 2.500000

Using pandas
The first option is to use pandas DataFrames’ method sample():

localhost:8888/nbconvert/html/OneDrive/Bishal/OneDrive/Python learning/Linear Regression/Train_Test_Splitting.ipynb?download=false 2/3

3/14/22, 4:09 PM Train_Test_Splitting

Return a random sample of items from an axis of object. You can use random_state for
reproducibility

We initially create the training set by taking a sample with a fraction of 0.8 from the overall rows
in the pandas DataFrame.

Note that we also define random_state which corresponds to the seed, so that results are
reproducible.

Subsequently, we create the testing set by simply dropping the corresponding indices from the
original DataFrame which are now included in the training set.

In [3]:
training_data = df.sample(frac=0.8, random_state=25)
testing_data = df.drop(training_data.index)

print(f"No. of training examples: {training_data.shape[0]}")

print(f"No. of testing examples: {testing_data.shape[0]}")

No. of training examples: 120

No. of testing examples: 30

Using scikit-learn
The second option — and probably the most commonly used — is the use of sklearn ‘s method
called train_test_split():

Split arrays or matrices into random train and test subsets

We can create both the training and testings sets in a one-liner by passing to train_test_split()
the modelling DataFrame along with the fraction of the examples that should be included in the
testing set.

As before, we also set a random_state so that the results are reproducible, that is every time we
run the code, the same instances will be included in the training and testing sets respectively.

The method returns a tuple with two DataFrames containing the training and testing examples.

In [4]:
from sklearn.model_selection import train_test_split

training_data, testing_data = train_test_split(df, test_size=0.2, random_state=25)

print(f"No. of training examples: {training_data.shape[0]}")

print(f"No. of testing examples: {testing_data.shape[0]}")

No. of training examples: 120

No. of testing examples: 30

In [ ]:

localhost:8888/nbconvert/html/OneDrive/Bishal/OneDrive/Python learning/Linear Regression/Train_Test_Splitting.ipynb?download=false 3/3

DSE 6 - Colab
No ratings yet
DSE 6 - Colab
5 pages
'Iris - CSV': Import As
No ratings yet
'Iris - CSV': Import As
3 pages
Machine Learning - Lab Record
No ratings yet
Machine Learning - Lab Record
43 pages
ML FINAL Lab Manual
No ratings yet
ML FINAL Lab Manual
7 pages
Data Science Lab Program Printout
No ratings yet
Data Science Lab Program Printout
43 pages
Vlsi Interview Questions
0% (1)
Vlsi Interview Questions
10 pages
K Fold
No ratings yet
K Fold
2 pages
Unsupervised ML
No ratings yet
Unsupervised ML
17 pages
Introduction To Neural Networks
No ratings yet
Introduction To Neural Networks
4 pages
Import As Import As From Import Import As Import As From Import From Import From Import
No ratings yet
Import As Import As From Import Import As Import As From Import From Import From Import
6 pages
Implementing Logistic Regression For Iris Using Sklearn and Checking The Accuracy Using Confusion Matrix
No ratings yet
Implementing Logistic Regression For Iris Using Sklearn and Checking The Accuracy Using Confusion Matrix
7 pages
Contoh Soal
No ratings yet
Contoh Soal
2 pages
Untitled2 - Jupyter Notebook
No ratings yet
Untitled2 - Jupyter Notebook
4 pages
CNAS (PS-DBM) June 13, 2025
No ratings yet
CNAS (PS-DBM) June 13, 2025
5 pages
Assignment 5'
No ratings yet
Assignment 5'
4 pages
Task 7
No ratings yet
Task 7
14 pages
Data Visualization
No ratings yet
Data Visualization
18 pages
Exp 5,6,7
No ratings yet
Exp 5,6,7
2 pages
Python Matplotlib Hands On
100% (1)
Python Matplotlib Hands On
6 pages
Data Science with Python Tools
No ratings yet
Data Science with Python Tools
1 page
Experiment 1
No ratings yet
Experiment 1
2 pages
BCA Syllabus
No ratings yet
BCA Syllabus
21 pages
Keeraiit 2
No ratings yet
Keeraiit 2
19 pages
3) Code For ID3 Algorithm Implementation
100% (1)
3) Code For ID3 Algorithm Implementation
8 pages
TranMinhTu1 bt2 2
No ratings yet
TranMinhTu1 bt2 2
5 pages
Data Visualizationyuo
No ratings yet
Data Visualizationyuo
28 pages
L3 - Classification - RandomForest - Jupyter Notebook
No ratings yet
L3 - Classification - RandomForest - Jupyter Notebook
6 pages
Dsbdalab 6
No ratings yet
Dsbdalab 6
5 pages
Xs Max
No ratings yet
Xs Max
5 pages
DSBDA6
No ratings yet
DSBDA6
6 pages
DL Lab 3
No ratings yet
DL Lab 3
5 pages
ML#07
No ratings yet
ML#07
21 pages
SWApp Information System
No ratings yet
SWApp Information System
27 pages
Eai Exp 2-5
No ratings yet
Eai Exp 2-5
13 pages
PracticalWeek03a
No ratings yet
PracticalWeek03a
1 page
Pre-Processing Techniques - Ipynb - Colab
No ratings yet
Pre-Processing Techniques - Ipynb - Colab
3 pages
ML Expt 2
No ratings yet
ML Expt 2
5 pages
7 Output
No ratings yet
7 Output
4 pages
Support Vector Machine (SVM Classifier) Implemenation in Python With Scikit-Learn
No ratings yet
Support Vector Machine (SVM Classifier) Implemenation in Python With Scikit-Learn
21 pages
#Exp2 Eda On 2 Variable Dataset
No ratings yet
#Exp2 Eda On 2 Variable Dataset
4 pages
Save & Restore ARKit World Maps
No ratings yet
Save & Restore ARKit World Maps
9 pages
Flores
No ratings yet
Flores
4 pages
ICT Safety and Security Guide
No ratings yet
ICT Safety and Security Guide
7 pages
TDX Agentforce Hackathon Rules
No ratings yet
TDX Agentforce Hackathon Rules
11 pages
Assignment 2 CS Sec#4
No ratings yet
Assignment 2 CS Sec#4
5 pages
AZ204 Resources
No ratings yet
AZ204 Resources
3 pages
Linear Equations
No ratings yet
Linear Equations
4 pages
ML Lab Record
No ratings yet
ML Lab Record
64 pages
Lab Session 10
No ratings yet
Lab Session 10
9 pages
Practical No 1 - Merged
No ratings yet
Practical No 1 - Merged
6 pages
1 Udemy For Business Courses in Native Bahasa Indonesia
No ratings yet
1 Udemy For Business Courses in Native Bahasa Indonesia
7 pages
Linear Regression for Beginners
No ratings yet
Linear Regression for Beginners
6 pages
ML Mini Project: Name: Sarvesh Muttepwar Class: BE COMP (A) Roll No: 21CEBEB11
No ratings yet
ML Mini Project: Name: Sarvesh Muttepwar Class: BE COMP (A) Roll No: 21CEBEB11
12 pages
ML L - Ab
No ratings yet
ML L - Ab
13 pages
Program1 MLA Lab 2025 250109 144615
No ratings yet
Program1 MLA Lab 2025 250109 144615
17 pages
ML N PY Programs
No ratings yet
ML N PY Programs
17 pages
TTC Catalog - EN 2013
No ratings yet
TTC Catalog - EN 2013
148 pages
Python Matplotlib Hands On - Compress
No ratings yet
Python Matplotlib Hands On - Compress
6 pages
ML LabReport Final Index Edited
No ratings yet
ML LabReport Final Index Edited
35 pages
EXP 07 (ML) - Ashu
No ratings yet
EXP 07 (ML) - Ashu
4 pages
Implementation of Simple Linear Regression Algorithm Using Python
No ratings yet
Implementation of Simple Linear Regression Algorithm Using Python
12 pages
2 Smartforms
No ratings yet
2 Smartforms
7 pages
Aspire Archon User Manual
No ratings yet
Aspire Archon User Manual
1 page
Subject Title: MICROCONTROLLER: 18EC46 Model Question Paper-2 With Effect From 2019-20 (CBCS Scheme)
No ratings yet
Subject Title: MICROCONTROLLER: 18EC46 Model Question Paper-2 With Effect From 2019-20 (CBCS Scheme)
2 pages
Navneet Kaur PM 1
No ratings yet
Navneet Kaur PM 1
3 pages
SK Learn 1
No ratings yet
SK Learn 1
11 pages
SC Assignment Q2
No ratings yet
SC Assignment Q2
7 pages
KNN - Jupyter Notebook
No ratings yet
KNN - Jupyter Notebook
7 pages
Practical No - 1
No ratings yet
Practical No - 1
5 pages
Wikipedia Consensus
No ratings yet
Wikipedia Consensus
6 pages
OEM OEM Preinstallation Preinstallation Kit (OPK) Overview Kit (OPK) Overview
No ratings yet
OEM OEM Preinstallation Preinstallation Kit (OPK) Overview Kit (OPK) Overview
32 pages
SPiCE for Software Process Improvement
No ratings yet
SPiCE for Software Process Improvement
22 pages
Machine Learning Group Project
No ratings yet
Machine Learning Group Project
22 pages
Main Project 2021 Zeroth
No ratings yet
Main Project 2021 Zeroth
9 pages
Remote Radiotherapy Planning The EIMRT Project
No ratings yet
Remote Radiotherapy Planning The EIMRT Project
7 pages
Self Cleaning NH4 N Modbus Instruction en
No ratings yet
Self Cleaning NH4 N Modbus Instruction en
21 pages
Isabela State University: Republic of The Philippines Cauayan City, Isabela
No ratings yet
Isabela State University: Republic of The Philippines Cauayan City, Isabela
20 pages
Mlda - Lab
No ratings yet
Mlda - Lab
35 pages
Business Stats Analysis Report
No ratings yet
Business Stats Analysis Report
3 pages
Lab Manual
No ratings yet
Lab Manual
32 pages
Fds Mannual
No ratings yet
Fds Mannual
39 pages
Unit 09 - Assignment 02 Guide
0% (1)
Unit 09 - Assignment 02 Guide
2 pages
Welcome To Jiwaji
No ratings yet
Welcome To Jiwaji
1 page
1 Assignment 3 - Classification
No ratings yet
1 Assignment 3 - Classification
16 pages
Final ETI Micro Project Report
0% (1)
Final ETI Micro Project Report
17 pages
SPlit An Optimal Method For Data Splitting
No ratings yet
SPlit An Optimal Method For Data Splitting
36 pages
AI & ML Lab Journal for MCA Students
No ratings yet
AI & ML Lab Journal for MCA Students
77 pages
Advantage 1 More Practice Burlington Books Compress
No ratings yet
Advantage 1 More Practice Burlington Books Compress
50 pages