Exploratory Data Analysis

The document provides an overview of Exploratory Data Analysis (EDA), outlining its purpose, which is to analyze data through structured techniques to extract insights. It covers key steps such as data description, pre-processing, visualization, and preparation, emphasizing the importance of cleaning data and handling outliers. The document also details methods for univariate and bivariate analysis, including various visualization techniques and outlier detection methods.

Uploaded by

raghuvanshiaryan1331

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views12 pages

Exploratory Data Analysis

Uploaded by

raghuvanshiaryan1331

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Exploratory Data

Analysis

(EDA)
Agenda

Introduction to EDA
Describe Data (Descriptive Analytics)
Data Pre-processing
Data Visualization
Data Preparation
Introduction to EDA
EDA is an approach to analyze data using both non-
visual and visual techniques
Generation of insights is a “Creative” process ,
however there is a structured approach which is
followed
Involves thorough analysis of data to understand
the current business situation
EDA objective is to extract “Gold” from the “data
mine” based on domain understanding
Describe Data

Know the Problem Statement or the Business Objective

Load and view the given data
Check the relevance of the data against the objective or
goal to be achieved
TimeScope
relevance ofdata
of the the data
Quantum of data
Features of the data
Understand each feature in the data with help of Data Dictionary
Know the central tendency and data distribution of each feature
Data Pre-processing

Practical data set generally has lot of “noise” and/or “undesired” data points which
might impact the outcome, hence pre-processing is an important step
As these “noise” elements are so well amalgamated with the complete data
set,
Cleaning process is more governed by the data scientist ability
These noise elements are in the form of
Bad values
Anomalies (Not valid or not adhering to business rules)
Missing values
Not Useful Data
How to detect ‘Bad Values’?

Numeric Fields:
Check if datatype of every numeric feature/column is valid
‘Salary Amount’ field is expected to be numeric with data type as float
But if the data type appears as ‘Object’ there is bad data which has to be cleaned

‘Age’ field with a minimum value of 0 and maximum as 60

Categorical Fields:
Check categorical levels of each feature/column with “Object” datatype
Level may have some special characters like “?” , “-”, “!” or invalid categories which does not
represent the feature
Univariate Analysis
Data Visualization
Data Summary (Univariate Analysis)
Data Types Central Tendency Distribution Graphs

Numeric Variables Mean Standard Deviation Histogram Boxplot

Median Range
Mode IQR
5
Numbe
r
Summa
ry

Categorical Mode Frequency of the Countplot

Variables levels
Bivariate Analysis
Data Visualization
Data Types (Bivariate Analysis)
Numeric & Numeric Variable OR Pairplot/ Correlation Plot/
All Numeric Variables Scatterplot Heatmap

Categorical & Categorical Variable Countplot with hue(diff colour)

Categorical & Numeric Variable Boxplot (x as Categorical & y as Numeric)

Data Preparation

Scaling
Transformation
Outliers
Detection &
Treatment
Data encoding
Outliers

Outliers are data points that have a value significantly different than the rest of the
values in the feature
It might be a valid data point or may have been caused due to error
If we consider height of student of class 7, most of them may be in a range of 4.8 Feet to 5.4
Feet.there maybe 1 or 2 students who are around 4 Feet or around 6 feet
however,
During data entry extra zeros have been added to an amount field making it different from others
Most of the data provided for Fraud detection will have very few records where fraud has occurred.
There are high chances that these records get identified as outliers
Hence, it is important to analyze the outliers before deciding on treatment
Outliers
Outlier treatment is not mandatory
There are algorithms in machine learning that are not very sensitive to outliers
We can choose the relevant algorithms to work on the data
When essential, outlier treatments are done with following
considerations:
treatment of outliers should not change the meaning of the data to a great extent which in turn
reflects current business situation
Business or domain knowledge to be taken into account to decide on the treatment
Basic techniques to detect outliers
Z Score
Boxplot
Outlier Detection
ZScore
First scale the variables by applying ZScore
All records with score greater than 3 and less than -3 are
considered as outliers
For a feature, if we assume a normal distribution, 99.7%
data points are within ± 3 σ value, anything beyond it is
outlier which are very few data points
Boxplot
Any data point more than Q3+1.5*IQR or less than Q1-
1.5*IQR is taken as an outlier
50% of data points are within ± 0.5 IQR of the median
In a normal distribution 68% are with ± 1σ
So IQR (50%) is slightly less than ± 1σ (68%)
In order to correspond ± 3 σ range, ±1.5IQR (i.e. 3* ± 0.5
IQR) is taken as range to identify outliers

Data Quality
No ratings yet
Data Quality
14 pages
Handling Outliers
No ratings yet
Handling Outliers
6 pages
Unit I Introduction To Multimedia
No ratings yet
Unit I Introduction To Multimedia
24 pages
Data Preparation Modeling Evaluation
No ratings yet
Data Preparation Modeling Evaluation
145 pages
Session 4
No ratings yet
Session 4
40 pages
ML ch-1
No ratings yet
ML ch-1
32 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Answer Key
No ratings yet
Answer Key
10 pages
6 - InnovatiCS - Data Visualization (Numerical & Graphical Descriptive Statistics)
No ratings yet
6 - InnovatiCS - Data Visualization (Numerical & Graphical Descriptive Statistics)
96 pages
ML Lab Manual Bcsl602
No ratings yet
ML Lab Manual Bcsl602
108 pages
UN IT-5 Outliers and Statistical Approaches in Data Mining-2
No ratings yet
UN IT-5 Outliers and Statistical Approaches in Data Mining-2
15 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
DATA 240 - 23 - Lec3 - FA 2024 - Dist
No ratings yet
DATA 240 - 23 - Lec3 - FA 2024 - Dist
50 pages
IMPDAV
No ratings yet
IMPDAV
105 pages
Data Analysis for Outlier Detection
100% (1)
Data Analysis for Outlier Detection
28 pages
Lecture 22
No ratings yet
Lecture 22
20 pages
DS&ML 4
No ratings yet
DS&ML 4
9 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
03 Data Science Process - Fall 23-24
No ratings yet
03 Data Science Process - Fall 23-24
38 pages
1 Program
No ratings yet
1 Program
20 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
DP
No ratings yet
DP
44 pages
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
No ratings yet
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
57 pages
CAC 428 Topic 2 - Data Quality
No ratings yet
CAC 428 Topic 2 - Data Quality
29 pages
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
How To Calculate Outliers
No ratings yet
How To Calculate Outliers
7 pages
Data Quality
100% (2)
Data Quality
16 pages
Exploratory Data Analysis Guide
No ratings yet
Exploratory Data Analysis Guide
33 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Data Visualization & Statistical Analysis
No ratings yet
Data Visualization & Statistical Analysis
58 pages
What Is Outlier
No ratings yet
What Is Outlier
3 pages
EDA - Task
No ratings yet
EDA - Task
20 pages
EDA Guide for Data Analysts
No ratings yet
EDA Guide for Data Analysts
35 pages
Expt 2
No ratings yet
Expt 2
3 pages
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
AI Learning: Types and Applications
No ratings yet
AI Learning: Types and Applications
18 pages
Dsi237 Group 2
No ratings yet
Dsi237 Group 2
27 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
85 pages
Outliers
No ratings yet
Outliers
3 pages
Unit 1
No ratings yet
Unit 1
21 pages
CC&BD Unit 4
No ratings yet
CC&BD Unit 4
12 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Anomalies in Dataset
No ratings yet
Anomalies in Dataset
4 pages
Eda 2022 04 11 09352244
No ratings yet
Eda 2022 04 11 09352244
35 pages
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
Subtitle
No ratings yet
Subtitle
2 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
DS
No ratings yet
DS
4 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Krishnendu PCB-IT602B
No ratings yet
Krishnendu PCB-IT602B
11 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
28 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Chapter 1
No ratings yet
Chapter 1
149 pages
Capstone Design Project Weekly Progress Report: Dynamic Autoselection of Machine Learning Model in Networks of Cloud
No ratings yet
Capstone Design Project Weekly Progress Report: Dynamic Autoselection of Machine Learning Model in Networks of Cloud
6 pages
Six Sigma in Fraud Detection Analysis
No ratings yet
Six Sigma in Fraud Detection Analysis
4 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
NIT Patna Resume 1
No ratings yet
NIT Patna Resume 1
1 page
Unit1 and 2
No ratings yet
Unit1 and 2
113 pages
14 TMS-555
No ratings yet
14 TMS-555
2 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
Cloud & Green Computing in India
No ratings yet
Cloud & Green Computing in India
5 pages
1preparing Data
No ratings yet
1preparing Data
6 pages
Unit - 3: Big Data Analytics
No ratings yet
Unit - 3: Big Data Analytics
23 pages
Association Rule Mod 3
No ratings yet
Association Rule Mod 3
28 pages
AdversLLM: A Practical Guide To Governance, Maturity and Risk Assessment For LLM-Based Applications
No ratings yet
AdversLLM: A Practical Guide To Governance, Maturity and Risk Assessment For LLM-Based Applications
18 pages
Business Intelligence Carlo Vercellis
No ratings yet
Business Intelligence Carlo Vercellis
5 pages
Data Augmentation On Plant Leaf Disease Image Dataset Using Image Manipulation and Deep Learning Techniques
No ratings yet
Data Augmentation On Plant Leaf Disease Image Dataset Using Image Manipulation and Deep Learning Techniques
6 pages
Machine Learning and Web Scraping Lecture 01
No ratings yet
Machine Learning and Web Scraping Lecture 01
19 pages
What Is An Intelligent System-Paper-2022
100% (1)
What Is An Intelligent System-Paper-2022
16 pages
NLP M1
No ratings yet
NLP M1
31 pages
Practical Lesson 3
No ratings yet
Practical Lesson 3
2 pages
CV Bengisu Takkin
No ratings yet
CV Bengisu Takkin
1 page
Connect JSP With Mysql
No ratings yet
Connect JSP With Mysql
19 pages
Core Concepts of Supervised, Unsupervised, and Reinforcement Learning
No ratings yet
Core Concepts of Supervised, Unsupervised, and Reinforcement Learning
3 pages
2006 - Rivero, Doorn & Ferraggine - Encyclopedia of Database Technologies and Applications
No ratings yet
2006 - Rivero, Doorn & Ferraggine - Encyclopedia of Database Technologies and Applications
765 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
AI and Linear Algebra
No ratings yet
AI and Linear Algebra
2 pages
B.tech R-22 Iii - II
No ratings yet
B.tech R-22 Iii - II
11 pages
2.5 Pre - and Post-Coordinate Indexing
No ratings yet
2.5 Pre - and Post-Coordinate Indexing
48 pages
20231503032210TANGAZO LA KAZI - eGA
No ratings yet
20231503032210TANGAZO LA KAZI - eGA
13 pages
DBMS Syllabus
No ratings yet
DBMS Syllabus
3 pages
Most Detailed 4 Data Mining Answers
No ratings yet
Most Detailed 4 Data Mining Answers
3 pages
Cns Sem
No ratings yet
Cns Sem
8 pages
Nursing Informatics
No ratings yet
Nursing Informatics
7 pages
Lit Survey
No ratings yet
Lit Survey
2 pages
Mirpur University of Science & Technology, MUST Mirpur AJ&K
No ratings yet
Mirpur University of Science & Technology, MUST Mirpur AJ&K
3 pages