“Torture the data, and it will confess to
anything”
But, Raw, Unprocessed Data isn’t of much use………
EXPLORATORY DATA ANALYSIS - EDA
Exploratory Data Analysis (EDA) is the process of visualizing and
analysing data to extract insights from it.
In other words, EDA is the process of summarizing important
characteristics of data in order to gain better understanding of
the dataset.
After the data has been collected, it undergoes some processing
before being cleaned and EDA is then performed. Notice that
after EDA, we may go back to processing and cleaning of data,
i.e., this can be an iterative process.
It mainly has the following goals:
To gain an understanding of data and find clues from the data
o Analyse the data
o Extract insights from it
Preparing the proper input dataset, compatible with the
machine learning algorithm requirements.
Improving the performance of machine learning models.
Data scientists spend 80% of their time on data preparation:
Best way to achieve expertise in feature engineering is practicing
different techniques on various datasets and observing their
effect on model performances.
Basic python scripts need Pandas and Numpy library to run
basic operations like data tables, arithmetic, logical…etc.
import pandas as pd
import numpy as np
Convert to data frame:
df=pd.DataFrame(“data”)
Missing values:
Missing values are one of the most common problems you can
encounter when you try to prepare your data for machine
learning. The reason for the missing values might be human
errors, interruptions in the data flow, privacy concerns, and so
on. These affect the performance of the machine learning models.
Most of the algorithms do not accept datasets with missing values
and gives an error.
df.isnull()
df.isnull().sum()
We can handle missing values in many ways:
Delete: You can delete the rows with the missing values or delete
the whole column which has missing values.
df=df.dropna() Axis and Inplace
Impute: Deleting data might cause huge amount of information
loss. So, replacing data might be a better option than deleting.
However, there is an important selection of what you impute to
the missing values.
Numerical Imputation:
Considering a possible default value of missing values in the
column.
df = df.fillna(0) Filling all missing values with ‘0’
One standard replacement technique is to replace missing values
with the average value of the entire column.
df=df.fillna(df.mean())
But, best imputation way is to use the medians of the columns
even when distribution is skewed. As the averages of the columns
are sensitive to the outlier values
df=df.fillna(df.median())
Categorical Imputation:
Replacing the missing values with the maximum occurred
value (Mode) in a column is a good option for handling
categorical columns.
df['column_name'].fillna(df['column_name'].value_counts()
.idxmax(), inplace=True)
Predictive filling:
Alternatively, you can choose to fill missing values through
predictive filling.
df=df.interpolate(method=’linear’) Filling all missing values with
linearly formed data
Predictive model:
Creating a predictive model for filling the data
Dataset split into 2 dataset
One without missing values and one with missing values
Create a model with dataset that is without missing values
Run the model on dataset with missing values
Thus, the missing values are filled. But estimated values are well
behaved.
Handling Outliers:
Best way to detect the outliers is to demonstrate the data visually.
Box and whisker plot is most easy method to detect outliers
import seaborn as sns
sns.boxplot(x=’Variable',y='Target variable',data=df)
An Outlier Dilemma: Drop or Cap:
Dropping the data which is an outlier is one of the option to
handle the outliers
Dropping the outlier rows with Percentiles
upper_lim = df['column'].quantile(.95)
lower_lim = df['column'].quantile(.05)
df = df[(data['column'] < upper_lim) & (df['column'] > lower_lim)]
Another option for handling outliers is to cap them instead of
dropping.
Capping the outlier rows with Percentiles
upper_lim = df['column'].quantile(.95)
lower_lim = df['column'].quantile(.05)
df.loc[(df[column] > upper_lim),column] = upper_lim
df.loc[(df[column] < lower_lim),column] = lower_lim
Binning:
Binning can be applied on both categorical and numerical data
Numerical Binning
ExampleValue Bin
0-30 -> Low
31-70 -> Mid
71-100 -> High
Categorical Binning
ExampleValue Bin
Spain -> Europe
Italy -> Europe
Chile -> South America
Brazil -> South America
Numerical Binning Example
df['bin'] = pd.cut(df['value'], bins=[0,30,70,100],
labels=["Low", "Mid", "High"])
value bin
0 2 Low
1 45 Mid
2 7 Low
3 85 High
4 28 Low
Categorical Binning Example
conditions = [
df['Country'].str.contains('Spain'),
df['Country'].str.contains('Italy'),
df['Country'].str.contains('Chile'),
df['Country'].str.contains('Brazil')]
choices = ['Europe', 'Europe', 'South America', 'South America']
df['Continent'] = np.select(conditions, choices,
default='Other')
Country Continent
0 Spain Europe
1 Chile South America
2 Australia Other
3 Italy Europe
4 Brazil South America
Log Transform
Logarithm transformation (or log transform) is one of the most
commonly used mathematical transformations in feature
engineering. What are the benefits of log transform?
It helps to handle skewed data and after transformation, the
distribution becomes more approximate to normal.
In most of the cases the magnitude order of the data changes
within the range of the data.
It also decreases the effect of the outliers, due to the
normalization of magnitude differences and the model
become more robust.
Log Transform Example
df['log+1'] =(df['value']+1).transform(np.log)
Negative Values Handling
Note that the values are different
df['log'] = (df['value']-df['value'].min()+1) .transform(np.log)
value log(x+1) log(x-min(x)+1)
0 2 1.09861 3.25810
1 45 3.82864 4.23411
2 -23 nan 0.00000
3 85 4.45435 4.69135
4 28 3.36730 3.95124
5 2 1.09861 3.25810
6 35 3.58352 4.07754
7 -12 nan 2.48491
One-hot encoding (or) Dummy coding:
One-hot encoding is one of the most common encoding methods
in machine learning. This method spreads the values in a column
to multiple flag columns and assigns 0 or 1 to them. These binary
values express the relationship between grouped and encoded
column. This method changes your categorical data, which is
challenging to understand for algorithms, to a numerical format
and enables you to group your categorical data without losing any
information. If you have N distinct values in the column, it is
enough to map them to N-1 binary columns, because the missing
value can be deducted from other columns.
encoded_columns = pd.get_dummies(df['column'])
df = df.join(encoded_columns).drop('column', axis=1)
New variable creation:
Sometimes, there will be a possibility of creating a new variable
by gaining information from 2 or more columns
This can be used to find the hidden relationship between the
variable and target.
E.g.: In ticket reservation:
No of passenger and their relationship column can be used to get
the information whether the passenger is travelling with family
or with friends or alone
Feature Split:
Splitting features is a good way to make them useful in terms of
machine learning. By extracting the utilizable parts of a column
into new features:
We enable machine learning algorithms to comprehend them.
Make possible to bin and group them.
Improve model performance by uncovering potential
information.
String extraction example
data.title.head()
0 Toy Story (1995)
1 Jumanji (1995)
2 Grumpier Old Men (1995)
3 Waiting to Exhale (1995)
4 Father of the Bride Part II (1995)
data.title.str.split("(", n=1, expand=True)[1].str.split(")", n=1,
expand=True)[0]
0 1995
1 1995
2 1995
3 1995
4 1995
Scaling:
In most cases, the numerical features of the dataset do not have
a certain range and they differ from each other. In real life, it is
nonsense to expect age and income columns to have the same
range. But from the machine learning point of view, how these
two columns can be compared?
Scaling solves this problem. The continuous features become
identical in terms of the range, after a scaling process. This
process is not mandatory for many algorithms, but it might be
still nice to apply. However, the algorithms based
on distance calculations such as k-NN or k-Means need to have
scaled continuous features as model input.
Basically, there are two common ways of scaling:
Normalization
Normalization (or min-max normalization) scale all values in a
fixed range between 0 and 1. This transformation does not change
the distribution of the feature and due to the decreased standard
deviations, the effects of the outliers’ increases. Therefore, before
normalization, it is recommended to handle the outliers.
data = pd.DataFrame({'value':[2,45, -23, 85, 28, 2, 35, -12]})
data['normalized'] = (data['value'] - data['value']. min ()) /
(data['value']. max() - data['value'].min())
value normalized
0 2 0.23
1 45 0.63
2 -23 0.00
3 85 1.00
4 28 0.47
5 2 0.23
6 35 0.54
7 -12 0.10
Standardization
Standardization (or z-score normalization) scales the values
while taking into account standard deviation. If the standard
deviation of features is different, their range also would differ
from each other. This reduces the effect of the outliers in the
features.
In the following formula of standardization, the mean is shown
as μ and the standard deviation is shown as σ.
data = pd.DataFrame({'value':[2,45, -23, 85, 28, 2, 35, -12]})
data['standardized'] = (data['value'] - data['value'].mean()) /
data['value'].std()
value standardized
0 2 -0.52
1 45 0.70
2 -23 -1.23
3 85 1.84
4 28 0.22
5 2 -0.52
6 35 0.42
7 -12 -0.92
“garbage in, garbage out!”
Univariate Analysis:
Check it Out:
The first step in examining your data:
df.head()
df.info()
df.describe()
Explore target Variable:
Find the distribution of target variable, if the target variable is
categorical then find the success rate, using matplotlib and
seaborn plots.
For continuous variable:
import matplotlib.pyplot as plt
x = target['Column'].hist(density=True, stacked=True)
target['Column'].plot(kind='density')
plt.show()
For categorical variable:
import seaborn as sns
sns.countplot(x='Survived', data=df)
Visualization:
Histogram:
In the univariate analysis, we use histograms for analysing and
visualizing frequency distribution.
sns.distplot(df.column, kde=False)
Combination of histogram and distribution function
Box-plot
Second visualization tool used in the univariate analysis is box-
plot, this type of graph used for detecting outliers in data.
The distribution of continuous data that facilitates comparison
between variables or across the levels of categorical variables
sns.boxplot(x=df[‘categorical column’], y=df[‘continuous
column’], data=df)
Count Plot:
A histogram of categorical variable.
sns.countplot(x=df[‘categorical column’], data=df)
Bar Plot:
Represents the central tendency of a numerical variable with a
high solid rectangle and an error bar on top of it to represent the
uncertainty
Sns.barplot(x=’categorical variable’, data=df)
Bivariate Analysis:
Pair plot:
Plot pairwise relationships in a dataset. This function will create
a grid of graphs with combinations of all variables. Mostly
creates scatter plot.
sns.pairplot(df, hue=’target variable’)
Reg plot:
Plot data and linear regression models best fit line in the same
graph
sns.regplot(‘Var1’, ‘Var2’, data=df)
Joint plot:
Plot 2 variables with bi variate and univariate analysis in same
graph
sns.jointplot(‘Var1’, ‘Var2’, data=df, Kind=”Reg”)
Point plot:
Estimates intervals between categorical variables
sns.pointplot(x=’cat var’, y=’cont var’, hue=’cat var2’,
data=df)
Factor plot:
Used for multiple group comparison
sns.factorplot(data=df, x=var1, y=var2, hue =cat var)
Strip plot:
Scatter plot of one continuous and one categorical variable.
sns.stripplot(data=df, x=’cat variable’, y=’continuous
variable’)
Swarm plot:
One continuous and one categorical variable.
sns.swarnplot(data=df, x=’cat variable’, y=’continuous
variable’)
Co- variance:
∑(𝐱 − 𝐱̅)(𝐲 − 𝐲̅)
𝐧−𝟏
Co- relation:
𝐶𝑜 − 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
√𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑥 ) ∗ √𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑦)
R2:
𝑅 2 = (𝑐𝑜 − 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛)2
[𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 (𝑚𝑒𝑎𝑛) − 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑙𝑖𝑛𝑒)]
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 (𝑚𝑒𝑎𝑛)