Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views12 pages

Exploratory Data Analysis

The document provides an overview of Exploratory Data Analysis (EDA), outlining its purpose, which is to analyze data through structured techniques to extract insights. It covers key steps such as data description, pre-processing, visualization, and preparation, emphasizing the importance of cleaning data and handling outliers. The document also details methods for univariate and bivariate analysis, including various visualization techniques and outlier detection methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views12 pages

Exploratory Data Analysis

The document provides an overview of Exploratory Data Analysis (EDA), outlining its purpose, which is to analyze data through structured techniques to extract insights. It covers key steps such as data description, pre-processing, visualization, and preparation, emphasizing the importance of cleaning data and handling outliers. The document also details methods for univariate and bivariate analysis, including various visualization techniques and outlier detection methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Exploratory Data

Analysis

(EDA)
Agenda

Introduction to EDA
Describe Data (Descriptive Analytics)
Data Pre-processing
Data Visualization
Data Preparation
Introduction to EDA
EDA is an approach to analyze data using both non-
visual and visual techniques
Generation of insights is a “Creative” process ,
however there is a structured approach which is
followed
Involves thorough analysis of data to understand
the current business situation
EDA objective is to extract “Gold” from the “data
mine” based on domain understanding
Describe Data

Know the Problem Statement or the Business Objective


Load and view the given data
Check the relevance of the data against the objective or
goal to be achieved
TimeScope
relevance ofdata
of the the data
Quantum of data
Features of the data
Understand each feature in the data with help of Data Dictionary
Know the central tendency and data distribution of each feature
Data Pre-processing

Practical data set generally has lot of “noise” and/or “undesired” data points which
might impact the outcome, hence pre-processing is an important step
As these “noise” elements are so well amalgamated with the complete data
set,
Cleaning process is more governed by the data scientist ability
These noise elements are in the form of
Bad values
Anomalies (Not valid or not adhering to business rules)
Missing values
Not Useful Data
How to detect ‘Bad Values’?

Numeric Fields:
Check if datatype of every numeric feature/column is valid
‘Salary Amount’ field is expected to be numeric with data type as float
But if the data type appears as ‘Object’ there is bad data which has to be cleaned

‘Age’ field with a minimum value of 0 and maximum as 60


Categorical Fields:
Check categorical levels of each feature/column with “Object” datatype
Level may have some special characters like “?” , “-”, “!” or invalid categories which does not
represent the feature
Univariate Analysis
Data Visualization
Data Summary (Univariate Analysis)
Data Types Central Tendency Distribution Graphs

Numeric Variables Mean Standard Deviation Histogram Boxplot


Median Range
Mode IQR
5
Numbe
r
Summa
ry

Categorical Mode Frequency of the Countplot


Variables levels
Bivariate Analysis
Data Visualization
Data Types (Bivariate Analysis)
Numeric & Numeric Variable OR Pairplot/ Correlation Plot/
All Numeric Variables Scatterplot Heatmap

Categorical & Categorical Variable Countplot with hue(diff colour)

Categorical & Numeric Variable Boxplot (x as Categorical & y as Numeric)


Data Preparation

Scaling
Transformation
Outliers
Detection &
Treatment
Data encoding
Outliers

Outliers are data points that have a value significantly different than the rest of the
values in the feature
It might be a valid data point or may have been caused due to error
If we consider height of student of class 7, most of them may be in a range of 4.8 Feet to 5.4
Feet.there maybe 1 or 2 students who are around 4 Feet or around 6 feet
however,
During data entry extra zeros have been added to an amount field making it different from others
Most of the data provided for Fraud detection will have very few records where fraud has occurred.
There are high chances that these records get identified as outliers
Hence, it is important to analyze the outliers before deciding on treatment
Outliers
Outlier treatment is not mandatory
There are algorithms in machine learning that are not very sensitive to outliers
We can choose the relevant algorithms to work on the data
When essential, outlier treatments are done with following
considerations:
treatment of outliers should not change the meaning of the data to a great extent which in turn
reflects current business situation
Business or domain knowledge to be taken into account to decide on the treatment
Basic techniques to detect outliers
Z Score
Boxplot
Outlier Detection
ZScore
First scale the variables by applying ZScore
All records with score greater than 3 and less than -3 are
considered as outliers
For a feature, if we assume a normal distribution, 99.7%
data points are within ± 3 σ value, anything beyond it is
outlier which are very few data points
Boxplot
Any data point more than Q3+1.5*IQR or less than Q1-
1.5*IQR is taken as an outlier
50% of data points are within ± 0.5 IQR of the median
In a normal distribution 68% are with ± 1σ
So IQR (50%) is slightly less than ± 1σ (68%)
In order to correspond ± 3 σ range, ±1.5IQR (i.e. 3* ± 0.5
IQR) is taken as range to identify outliers

You might also like