0% found this document useful (0 votes)

108 views6 pages

Exploratory Data Analysis Guide

Uploaded by

arbimbb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

108 views6 pages

Exploratory Data Analysis Guide

Uploaded by

arbimbb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Standard Structure of Exploratory Data Analysis

Libraries used: Pandas, NumPy, Seaborn

1. Identification of variables and data types:

a. df.dtype
b. df.shape
c. df.info()

2. Analysing the basic metrics:

a. df.describe(include=’all’)
b. Including only numeric columns in a DataFrame description:
df.describe(include=[numpy.number])
c. Including only string columns in a DataFrame description:
df.describe(include=[np.object])
d. Including only categorical columns from a DataFrame description:
df.describe(include=['category'])
e. Analysing Null Values: df.isnull().sum()/len(df)*100
f. Conversions:
i. To datetime: df[‘date’]=pd.to_datetime(df[‘date’])
ii. To other datatype: df[‘num’]=df[‘num’].astype(‘int64’)

3. Brainstorming:

Domain understanding is a must have while solving business use-case. It helps in identifying
important variables from the list of available variables. This might include dropping of
redundant info/noise from the dataset.
Few things to keep in mind:

a. While doing EDA it might be helpful to have an understanding of how the data is
going to be used

b. An outcome of EDA could be the insight that data is insufficient for our problem
statement.

c. EDA can also help in brainstorming other use cases which the business might not
have thought of.

4. Non-Graphical Analysis:

a. df.column.value_counts()
b. df.column.value_counts(normalize=True)*100
c. The nunique() function in Pandas returns a series with several distinct observations
in a column: df[‘column’].nunique()
d. df.drop_duplicates()
e. Similarly, the unique() function of pandas returns the list of unique values in the
dataset: df[‘column’].unique()
f. For categorical variables, check df.groupby(var_x).agg({var_y:['count', 'mean'']})

5. Visual Analysis:

a. Histogram/Kdeplot

i. Distribution of the data — Whether the data is normally distributed or if it’s

skewed (to the left or right)
ii. To identify outliers — Extremely low or high values that do not fall near any
other data points
Visualization Examples:
1. df[‘col’].hist(bins=25)
2. seaborn.kdeplot(data=df, x="col")
3. seaborn.distplot(data=df, x=”col”)

b. Box Plot: visual representation of the statistical summary of a given data set. The
Summary includes:
i. Minimum
ii. First Quartile
iii. Median (Second Quartile)
iv. Third Quartile
v. Maximum

Also, It is also used to identify the outliers in the dataset

seaborn.boxplot(x=’Col1’, y=’Col2’, data=df)

6. Bivariate Analysis:

a. Correlation- seaborn.heatmap(df.corr(),annot=True)
b. Scatterplot- seaborn.scatterplot(x=’col1’, y=’col2’, df=data)
c. Pairplot – seaborn.pairplot(df.select_dtypes(include=’float64’))

7. Correlation and Covariance:

a. For sample: Covariance_S= numpy.cov(X,Y)[0][1]
b. For population: Covariance_P= numpy.cov(X,Y, bias=True)[0][1]
c. Correlation Coefficient: numpy.corrcoef(X,Y)[0][1]
d. Visualization: seaborn.heatmap(df.corr(), annot=True)

8. Variable transformations:
Feature Transformation and Scaling Techniques to Boost Your Model Performance
a. MinMaxScaler: It just scales all the data between 0 and 1.
b. StandardScaler: scales the values such that the mean is 0 and the standard deviation
is 1(or the variance).
c. RobustScaler: The Robust Scaler, as the name suggests, is not sensitive to outliers.
This scaler-
i. removes the median from the data
ii. scales the data by the InterQuartile Range(IQR)
d. Log Transformer: primarily used to convert a skewed distribution to a normal
distribution/less-skewed distribution np.log(df[‘col’])

Note: You can also check the value of kurtosis and the value of excess kurtosis which evaluates how
thick or thin the tails of a given probability distribution are compared to the normal distribution.

9. Missing value treatment:

a. Deletion of missing rows (rows % < 3)

b. Mean/ Mode/ Median Imputation (rows % between 3-30)
c. KNN Imputation
d. Drop column (if you know business) (rows % >50%)
e. For time series data: Forward fill and backward fill
f. df.fillna(method="ffill")
df.fillna(method="bfill")

10.Outlier treatment:

a. Z-score:
z_score = (i-mean)/std
if (np.abs(z_score) > threshold):
outliers.append(i)
b. Interquartile Range(IQR): compute lower bound = (Q1–1.5*IQR), upper bound =
(Q3+1.5*IQR)

11.Business Insights and Recommendations:

a. Recommendations should be one-sentence, succinct, and start with an action verb

(create, establish, fund, facilitate, coordinate, etc.).
They should use a SMART format - Specific, Measurable, Attainable, Realistic, Timely

b. PRO TIP ► Recommendations stem from the findings. Link each of your
recommendations to the finding that supports it, to highlight the direct connection
between assessment and action.

i. You can show this link visually by using a two-columned table: the first
column lists the finding, and the respective recommendation is listed
adjacently in the second column.
ii. You can also follow the following framework for writing recommendations-
“Our insights show that X, so we can do Y in order to achieve Z”

Mnemonic to remember all of this?

Brag to business using SMART recommendations
B- BRainstorming

A- Analysis of variables within dataset

G- Graphical Analysis/ Visualization

T- Transformation (for skewed variables)

O- Outliers & Missing Values Treatment

B- Business - Smart recommendations

Policy for sharing projects?

Many of the case studies in the Scaler DSML course are built in partnership with
other companies. These partner companies have shared their datasets and
problem statements with Scaler and, in doing so, they have given us a huge
responsibility to keep their assets protected. Hence, we ask all students to sign
an NDA in the beginning of the course.

At Scaler, we also understand that our students who work hard to solve these
business cases would like to showcase their solutions on their Github, Kaggle,
and Resume.

As such, here are the guidelines:

1. You may not post the Business Case Study problem statement anywhere.
2. You may not share any datasets anywhere, except for the publicly
available datasets from Kaggle or public domain.
3. You may upload your solutions to your Github or Kaggle profiles as long
as the solution:
4. does not contain any confidential information or intellectual property
owned by the partner company or Scaler.
5. does not compromise Scaler’s confidentiality agreement with the partner
company
6. does not compromise the NDA signed by the students
7. does not leak the shared datasets.
8. It is acceptable if small snippets or snapshots of the datasets may appear
in your solution, as long as it does not compromise the overall privacy of
the dataset, the partner company, or the stakeholders of the partner
company.
9. You may list the project titles, a short description of the problem, and a
short summary of your insights on your resume and on relevant platforms.
10. We recommend that instead of sharing the names of partner companies
you use a generic description that makes you look professional. eg:
11. Instead of saying “worked on a project from Uber”, you can say “worked
on a project from a top ride-sharing company”

IMPORTANT: Solution will be provided in the Review class for

selected questions only. At this stage, we will not provide Full
Solutions since these biz cases are visible to employers. Also, It will
be unfair for people who have submitted it within the deadline.
Imagine, we provided full solution during review and a person simply
copy pastes the solutions and submits the same. Learner will get 90+
marks anyhow.
Therefore, we provide solutions by the end of track.

Supervised Learning Research Paper Final With Images
No ratings yet
Supervised Learning Research Paper Final With Images
11 pages
Assignment JTW115E 2023-2024 v5
No ratings yet
Assignment JTW115E 2023-2024 v5
5 pages
Module 1
No ratings yet
Module 1
15 pages
Class X Use of Statistics in Data Science
No ratings yet
Class X Use of Statistics in Data Science
29 pages
Data Science Workflow
No ratings yet
Data Science Workflow
7 pages
Data Analyst Workflow
No ratings yet
Data Analyst Workflow
2 pages
C1000-177 STU SGC1000177v2
No ratings yet
C1000-177 STU SGC1000177v2
9 pages
Machine Learning Project Checklist
100% (1)
Machine Learning Project Checklist
10 pages
AWS Certified AI Practioner Day - 3
No ratings yet
AWS Certified AI Practioner Day - 3
5 pages
Lesson2 Notes
No ratings yet
Lesson2 Notes
13 pages
Data Science Notes Full
No ratings yet
Data Science Notes Full
5 pages
7 Data Preprocessing Steps in Machine Learning
No ratings yet
7 Data Preprocessing Steps in Machine Learning
5 pages
Math 5.11
No ratings yet
Math 5.11
2 pages
Data Analytics QP May 25
No ratings yet
Data Analytics QP May 25
4 pages
Haris Resume
No ratings yet
Haris Resume
2 pages
DS Project Requirements Ver 2021
No ratings yet
DS Project Requirements Ver 2021
2 pages
Steps in The Implementation of Data Analysis
No ratings yet
Steps in The Implementation of Data Analysis
2 pages
Data Science in Society Cat
No ratings yet
Data Science in Society Cat
5 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
Resume Template by Anubhav-1
No ratings yet
Resume Template by Anubhav-1
1 page
Untitled Document
No ratings yet
Untitled Document
4 pages
PreTest & Post Test
No ratings yet
PreTest & Post Test
3 pages
GC-105 Business Statistics
No ratings yet
GC-105 Business Statistics
3 pages
FOUND. DATA SCIENCE Practical
No ratings yet
FOUND. DATA SCIENCE Practical
15 pages
1.3 Quantitative Techniques - I BBA
No ratings yet
1.3 Quantitative Techniques - I BBA
6 pages
Set. No - 1 P18PECS031-Data Preparation and Analysis QP - PH.D.
No ratings yet
Set. No - 1 P18PECS031-Data Preparation and Analysis QP - PH.D.
22 pages
Fds Csheet and Read The Rule
No ratings yet
Fds Csheet and Read The Rule
4 pages
Econometrics 1st Edition K. Nirmal Ravi Kumar - Quickly Download The Ebook To Start Your Content Journey
100% (2)
Econometrics 1st Edition K. Nirmal Ravi Kumar - Quickly Download The Ebook To Start Your Content Journey
70 pages
Train Dataset
No ratings yet
Train Dataset
2 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
23 pages
CAIE-AS Level-Mathematics - Statistics 1
No ratings yet
CAIE-AS Level-Mathematics - Statistics 1
19 pages
Answer Key Split Up Fds
No ratings yet
Answer Key Split Up Fds
11 pages
DS Assignment 2
No ratings yet
DS Assignment 2
8 pages
Exam Preparation Notes
No ratings yet
Exam Preparation Notes
31 pages
Data Science Project Guidelines
No ratings yet
Data Science Project Guidelines
3 pages
QM Notes Sajin
No ratings yet
QM Notes Sajin
35 pages
Group Assignment - 2024 - 9
No ratings yet
Group Assignment - 2024 - 9
3 pages
Project2 - 158755. 4.21
No ratings yet
Project2 - 158755. 4.21
3 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Sample Phase 2 Document
No ratings yet
Sample Phase 2 Document
7 pages
Selvakumar Perumal: Education Skills
No ratings yet
Selvakumar Perumal: Education Skills
1 page
Untitled Document
No ratings yet
Untitled Document
23 pages
Train Ticket For 22 Wed
0% (1)
Train Ticket For 22 Wed
2 pages
Mcqs Prob 2
No ratings yet
Mcqs Prob 2
6 pages
Unit II 1
No ratings yet
Unit II 1
12 pages
Big Data Lab Manual: Open-Ended Analysis
No ratings yet
Big Data Lab Manual: Open-Ended Analysis
3 pages
Guía Completa de Análisis de Datos
No ratings yet
Guía Completa de Análisis de Datos
10 pages
DPT Week 1
No ratings yet
DPT Week 1
3 pages
Data Scientist
No ratings yet
Data Scientist
2 pages
Capstone Project Guidelines
No ratings yet
Capstone Project Guidelines
2 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
Data Preparation Basics#
No ratings yet
Data Preparation Basics#
2 pages
Case 1
No ratings yet
Case 1
15 pages
III Unit
No ratings yet
III Unit
4 pages
Self Intoduction 1 Project
No ratings yet
Self Intoduction 1 Project
11 pages
Standard Deviation Practice
100% (1)
Standard Deviation Practice
7 pages
Overview
No ratings yet
Overview
6 pages
Forecasting Methods Analysis
94% (17)
Forecasting Methods Analysis
5 pages
Statistical Analysis for Students
80% (5)
Statistical Analysis for Students
4 pages
ML Exp No 1
No ratings yet
ML Exp No 1
8 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
Eda 2 Code
No ratings yet
Eda 2 Code
20 pages
Each Stage of A Data Mining Project
No ratings yet
Each Stage of A Data Mining Project
5 pages
P Indicator
No ratings yet
P Indicator
4 pages
3 Must-Have Projects For Your Data Science Portfolio - by Aakash N S - Jovian - Jan, 2021 - Medium
No ratings yet
3 Must-Have Projects For Your Data Science Portfolio - by Aakash N S - Jovian - Jan, 2021 - Medium
1 page
Kavin
No ratings yet
Kavin
13 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
Statistical Analysis Formulas
No ratings yet
Statistical Analysis Formulas
1 page
DS203 2024 09 06 Data Problems 1
No ratings yet
DS203 2024 09 06 Data Problems 1
25 pages
Milestone
No ratings yet
Milestone
7 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Assignment-3 Key-Graphical Representations
No ratings yet
Assignment-3 Key-Graphical Representations
9 pages
Data Science Checklist
No ratings yet
Data Science Checklist
22 pages
Northbay Summarizes Data Pre-Processing Algorithms
No ratings yet
Northbay Summarizes Data Pre-Processing Algorithms
10 pages
L3 Overview of ML Model Development Lifecycle-1
No ratings yet
L3 Overview of ML Model Development Lifecycle-1
30 pages
Moving Average Analysis Guide
No ratings yet
Moving Average Analysis Guide
15 pages
BZAN6310 Chapter 2
No ratings yet
BZAN6310 Chapter 2
79 pages
Data Preparation Steps for Analysis
No ratings yet
Data Preparation Steps for Analysis
3 pages
HarvardX PH527X Planning Checklist 2017
No ratings yet
HarvardX PH527X Planning Checklist 2017
5 pages
Untitled
No ratings yet
Untitled
31 pages
Dav Exps - Merged - Merged
No ratings yet
Dav Exps - Merged - Merged
99 pages
Isatin Yield Experiment Analysis
No ratings yet
Isatin Yield Experiment Analysis
6 pages
Data Analytics Course for Beginners
No ratings yet
Data Analytics Course for Beginners
34 pages
ITECH2302 MainAssessment Report
No ratings yet
ITECH2302 MainAssessment Report
8 pages
PCED - Lösung en
No ratings yet
PCED - Lösung en
24 pages
Key Concepts in Data Science and Big Data
No ratings yet
Key Concepts in Data Science and Big Data
13 pages
Data Prep & EDA for Python Users
No ratings yet
Data Prep & EDA for Python Users
12 pages
Autocorrelation in Time Series
No ratings yet
Autocorrelation in Time Series
52 pages
MBA 2020 Basics
No ratings yet
MBA 2020 Basics
43 pages
Lesson 9-4 Reflection and Summary
No ratings yet
Lesson 9-4 Reflection and Summary
3 pages
Get TRDoc
No ratings yet
Get TRDoc
365 pages
Semi-Automated EDA in Python
No ratings yet
Semi-Automated EDA in Python
3 pages
Warranty Dishwasher
No ratings yet
Warranty Dishwasher
1 page
Spss Univariat Bivariat
No ratings yet
Spss Univariat Bivariat
5 pages
Chapter 5 Concept Description Characterization and Comparison 395
No ratings yet
Chapter 5 Concept Description Characterization and Comparison 395
64 pages
Two-Stage Sampling Explained
No ratings yet
Two-Stage Sampling Explained
21 pages
Police Academy Test Score Calculation
No ratings yet
Police Academy Test Score Calculation
24 pages