Lecture 22
Data Analytics
and
Visualization
Course Code: CS2205
Dr. Rahul Mishra
IIT Patna
Agenda
1. What is Exploratory Data Analysis?
2. Why EDA is important?
3. Visualization
- Important charts for visualization.
4. Steps involved in EDA:
- Data Sourcing
- Data Cleaning
- Univariate analysis with visualization
- Bivariate analysis with visualization
- Derived Metrics
5. Use Cases
2
3
What is Exploratory Data Analysis
• Exploratory Data Analysis is an approach to analyze the datasets to summarize their main
characteristics in form of visual methods.
• EDA is nothing but a data exploration technique to understand various aspects of the data.
• The main aim of EDA is to obtain confidence in a data to an extent where we are ready to
engage a machine learning model.
• EDA is important to analyze the data; it’s a first step in the data analysis process.
4
5
6
7
8
9
10
11
12
13
14
15
https://github.com/pik1989/EDA/blob/main/Feature_Scaling.ipynb
16
17
Introduction
• Outliers are extreme values in a dataset that deviate significantly from the norm. They do
not fit within the normal behavior of data and can impact statistical analysis and machine
learning models.
Detecting Outliers
1. Boxplot – Identifies outliers as
points beyond whiskers.
2. Histogram – Visualizes extreme
values in frequency distribution.
3. Scatter Plot – Outliers appear as
distant points.
4. Z-score – Values beyond ±3
standard deviations indicate outliers.
5. Interquartile Range (IQR) – Values
beyond 1.5 times IQR are outliers.
Handling Outliers
1. Remove the outliers if they result from data errors or significantly skew analysis.
2. Replace outliers with:
- Quantile Method: Replace outliers with percentile values.
- Interquartile Range: Adjust extreme values.
3. Use ML models less sensitive to outliers:
- K-Nearest Neighbors (KNN)
- Decision Trees
- Support Vector Machines (SVM)
- Naïve Bayes
- Ensemble Methods