0% found this document useful (0 votes)

21 views14 pages

Class4 DataPreprocessing DiscriptiveAnalytics 19aug2021

1) The document discusses data preprocessing techniques which are important for cleaning and preparing raw data for analysis. 2) It describes common data preprocessing steps like data cleaning, integration, transformation and reduction which are used to handle issues like missing values, noise, inconsistencies and reduce data size. 3) Descriptive analytics techniques are also covered, including measuring the central tendency of data using the mean, median and mode to understand characteristics of numeric attributes.

Uploaded by

siddharth0208yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views14 pages

Class4 DataPreprocessing DiscriptiveAnalytics 19aug2021

Uploaded by

siddharth0208yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

19-08-2021

Data Preprocessing

Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge

Data Modeling Inference

Data Collection (Machine
Learning)

Data Preprocessing

Data
Feature
Database Cleaning and
Representation
Cleansing
2

1
19-08-2021

Need for Data Preprocessing

• Real world data are tend to be incomplete, noisy and
inconsistent due to their huge size and their likely origin
from multiple heterogeneous sources
• Preprocessing is important to clean the data
• Low quality data will lead to low quality of analysis results
• If the users believe the data is of low quality (dirty), they
are unlikely to trust the results of any data analytics that
has been applied to
• Low quality data can cause confusion for analytic procedure
using machine learning techniques, resulting in unreliable
output
• Data could be
– Incomplete,
– noisy and
– inconsistent
– These are common properties of large real world databases

Data Preprocessing Techniques

• Data cleaning:

• Data integration:

• Data transformation:

• Data reduction :

2
19-08-2021

Data Preprocessing Techniques

• Data cleaning:
– Applied to
• identify the missing values,
• fill in missing values,
• remove noise and
• correct inconsistency in the data
• Data integration:
– It merges data from multiple sources in to a coherent
data source
• Data transformation:
– Transforming the entries of data to a common format
– Techniques like normalization and standardization
applied to transform the data to another form to
improve the accuracy and efficiency of machine learning
(ML) algorithms involving distance measures

Data Preprocessing Techniques

• Data reduction:
– Applied to obtain a reduced representation that is much
smaller in volume, yet producing almost same analytical
results
– It can reduce the data size by
• Aggregation
• Eliminating irrelevant and redundant features (attributes)
through correlation analysis
• Reducing dimension
• These techniques are not mutually exclusive; they
may work together

3
19-08-2021

Descriptive Data Summarization

(Descriptive Analytics)
• It serves as a foundation for data preprocessing
• It helps us to study the general characteristics of data
and identify the presence of noise or outliers
• Data characteristics:
– Central tendency of data
• Centre of the data
• Measuring mean, median and mode
– Dispersion of data
• The degree to which numerical data tend to spread
• Measuring range, quartiles, interquartile range (IQR), the
five-number summery and standard deviation

Descriptive Analytics:
Measuring Central Tendency
• Mean: Number of records
(tuples), N = 10
– Let x1, x2, …, xN be a set of N
Years of Salary (in
values in an attribute. Mean of experience Rs 1000)
this set of values is given by
3 30
1 N 8 57

N
x
i 1
i 9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83

Sum: 91

4
19-08-2021

Mean Years of 9.1

experience: Sum/10

Mean Salary: 55.4

Sum/10

5
19-08-2021

Mean: 9.1 55.4

Descriptive Analytics:
Measuring Central Tendency
• Median: Number of records
– Let x1, x2, …, xN be a set of N values (tuples), N = 10
in an attribute. The median is the Years of Salary (in
"middle" number (value), when experience Rs 1000)
those numbers are listed in order 3 30
from smallest to greatest. 8 57
– Median is the value separating the 9 64
higher half from the lower half of 13 72
a data sample
3 36
– For a given data of N values in sorted 6 43
order
11 59
• If N is odd, then median is the middle
value of the ordered list 21 90
• If N is even, then median is the 1 20
average of middle two values 16 83

Illustration: Median of attribute “Years of experience”

6
19-08-2021

Descriptive Analytics:
Measuring Central Tendency
• Median: Sort the values in “Years
– Let x1, x2, …, xN be a set of N values in of experience”
an attribute. The median is the Years of
"middle" number (value), when experience
those numbers are listed in order 1
from smallest to greatest. 3
– Median is the value separating the 6
higher half from the lower half of 8
a data sample
9
– For a given data of N values in sorted
11
order
13
• If N is odd, then median is the middle
value of the ordered list 16
• If N is even, then median is the 16
average of middle two values 21

Median:

7
19-08-2021

Descriptive Analytics:
Measuring Central Tendency
• Mode: Most frequent value in an attribute in the data
Number of records
(tuples), N = 10
Years of Salary (in
experience Rs 1000)
Illustration: Mode of attribute 3 30
“Years of experience” 8 57
Assume that values are discrete 9 64
numerical
13 72
3 36
6 43
11 59
21 90
1 20
16 83

Mode: 3

8
19-08-2021

Descriptive Analytics:
Measuring Central Tendency
• Mode: Most frequent value in an attribute in the data
Number of samples, N = 61 • The mode of a continuous
Date Temperature
variable is the value at which
the probability density function,
Sept 1 25.47
f(x) , is at a maximum.
Sept 2 26.19
Sept 3 25.17 • It is a value that is most likely
Sept 4 24.30 to lie within the same interval as
Sept 5 24.07 the outcome
Sept 6 21.21
Sept 7 23.49
Sept 8 21.79
Sept 9 25.09
Sept 10 25.39
--- ---
Oct 29 23.06
Oct 30 23.72
Oct 31 23.02
Mean: 22.85
Mode: (22.32 – 23.62]
Median: 22.89

Descriptive Analytics:
Measuring Central Tendency

Positively Skewed Negatively Skewed

Symmetric Data
Data Data

9
19-08-2021

Descriptive Analytics:
Measuring Dispersion of Data
• The degree to which numerical data tend to spread
• It is also called as variance (in symmetrically
distributed data)
• Common measures of data dispersion:
– Range
– The five-number summery (based on quartiles)
– The inter quartile range (IQR)
– Standard deviation
• Range: The range of a finite set of values is the
difference between the maximum and minimum
values

Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Number of records
(tuples), N = 10
– The kth percentile:
Years of Salary (in
• Let x1, x2, …, xN be a set of N experience Rs 1000)
values in an attribute 3 30
• The kth percentile of a set of data 8 57
in numerical order is the value of 9 64
xn having the property that k
13 72
percent of data entries lie at or
below xn 3 36
6 43
– Example: 50th percentile
11 59
• The value (number) below which
50% of the data entries (values) 21 90
lie 1 20
– Those 50% of entries have values 16 83
equal to or less that 50th
percentile

Illustration: 50th percentile of attribute “Years of

experience”

10
19-08-2021

Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Sort the values in “Years
of experience”
– The kth percentile:
Years of
• Let x1, x2, …, xN be a set of N experience
values in an attribute 1
• The kth percentile of a set of data 3
in numerical order is the value of 6
xn having the property that k
8
percent of data entries lie at or
below xn 9
11
– Example: 50th percentile
13
• The value (number) below which
50% of the data entries (values) 16
lie 16
– Those 50% of entries have values 21
equal to or less that 50th
percentile 50th Percentile: 10

Illustration: 50th percentile of attribute “Years of

experience”

Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Sort the values in “Years
of experience”
– The kth percentile:
Years of
• Let x1, x2, …, xN be a set of N experience
values in an attribute 1
• The kth percentile of a set of data 3
in numerical order is the value of 6
xn having the property that k
8
percent of data entries lie at or
below xn 9
11
– Example: 25th percentile
13
• The value (number) below which
25% of the data entries (values) 16
lie 16
– Those 25% of entries have values 21
equal to or less that 25th
percentile 25th Percentile: 6
• Middle element between minimum
and 50th percentile
Illustration: 25th percentile of attribute “Years of experience”

11
19-08-2021

Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Sort the values in “Years
of experience”
– The kth percentile:
Years of
• Let x1, x2, …, xN be a set of N experience
values in an attribute 1
• The kth percentile of a set of data 3
in numerical order is the value of 6
xn having the property that k
8
percent of data entries lie at or
below xn 9
11
– Example: 75th percentile
13
• The value (number) below which
75% of the data entries (values) 16
lie 16
– Those 75% of entries have values 21
equal to or less that 75th
percentile 75th Percentile: 16
• Middle element between
maximum and 50th percentile
Illustration: 75th percentile of attribute “Years of experience”

Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles:
– The kth percentile:
• Let x1, x2, …, xN be a set of N values in an attribute
• The kth percentile of a set of data in numerical order is the
value of xn having the property that k percent of data
entries lie at or below xn
• Median is the 50th percentile (the second quartile (Q2))
• The first quartile (Q1): It is the 25th percentile
• The third quartile (Q3): It is the 75th percentile
– The quartiles including median give some indication of
centre, spread and shape of distribution
• The distance between the Q1 and Q3 is a simple
measure of spread
• Inter quartile range (IQR): Distance between the first
quartile (Q1) and third quartile (Q2)
IQR = Q3 – Q1

12
19-08-2021

Descriptive Analytics:
Measuring Dispersion of Data
• The five-number summery of distribution:
– It consists of minimum value, Q1, median, Q3 and maximum
value
• Box plots are the popular way of visualising distribution
Largest
observation (max) Q3
(top whisker) Q2/
IQR
Median

Q1
Smallest
observation (min)
(bottom whisker)

• The whiskers terminate at

– Smallest (minimum) or largest (maximum) observations or
– the most extreme observations occurring within 1.5 x IQR of
respective quartiles (Q1 and Q3)

Q1
Smallest
observation (min)
(bottom whisker)

• 1.5 x IQR is equivalent to 2.7σ from mean if the distribution

is normal distribution
– It is close to 3σ from mean which is a standard in normal distribution

13
19-08-2021

Q1
Outlier: The values are larger Smallest
than 1.5 x IQR observation (min)
(bottom whisker)

Outlier(s): The values are less than 1.5 x IQR

• Lower bound: Q1 – (1.5 x IQR) Upper bound: Q3 + (1.5 x IQR)

• Outliers: Any datapoint less than the lower bound and
larger than the upper bound

Descriptive Analytics:
Measuring Dispersion of Data
• Variance (σ2):
– Let x1, x2, …, xN be a set of N values in an attribute.
variance (σ2) of this set of values is given by
1 N

2  xi   2 μ = mean
N  1 i 1
• Standard deviation (σ):
– The square root of variance   Variance
• Standard deviation measures the spread about the
mean
– It is used when the mean is chosen as the measure of
centre, especially in symmetric distribution
• The quartiles Q1 and Q3 measure the spread about
median
– Q1 and Q3 are used when the median is chosen as the
measure of centre, especially in skewed distribution
28

Dsa Report
No ratings yet
Dsa Report
11 pages
Slide For Chapter 3
No ratings yet
Slide For Chapter 3
26 pages
Data Mining: Prepared By: Eesha Tur Razia Babar
No ratings yet
Data Mining: Prepared By: Eesha Tur Razia Babar
49 pages
Data Summarization Techniques
No ratings yet
Data Summarization Techniques
16 pages
Data ch2
No ratings yet
Data ch2
16 pages
Slide For Chapter 3
No ratings yet
Slide For Chapter 3
26 pages
02 - Data Pre Processing
No ratings yet
02 - Data Pre Processing
91 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
DVA Unit 1 - Part 2
No ratings yet
DVA Unit 1 - Part 2
53 pages
Module 2c - Exploratory Data Analysis
No ratings yet
Module 2c - Exploratory Data Analysis
18 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
2 - Unit-Ii-2
No ratings yet
2 - Unit-Ii-2
66 pages
Descriptive Analysis
No ratings yet
Descriptive Analysis
20 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Process and Summarize Data
No ratings yet
Process and Summarize Data
2 pages
Unit 3
No ratings yet
Unit 3
43 pages
02data - 7 7 25
No ratings yet
02data - 7 7 25
63 pages
Topic 8 Data Processing and Analysis PDF
No ratings yet
Topic 8 Data Processing and Analysis PDF
157 pages
Data science-Unit-3-Complete
No ratings yet
Data science-Unit-3-Complete
33 pages
Chapter2 Data Exploration N Preprocessing
No ratings yet
Chapter2 Data Exploration N Preprocessing
95 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Module 1
No ratings yet
Module 1
64 pages
1 L2 Intro DAM
No ratings yet
1 L2 Intro DAM
27 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
Topics To Be Covered
No ratings yet
Topics To Be Covered
58 pages
02 Data
No ratings yet
02 Data
64 pages
Class3-9 DataPreprocessing 22Aug-06Sept2019
No ratings yet
Class3-9 DataPreprocessing 22Aug-06Sept2019
53 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
L1-D3 Concepts of Data Analysis
No ratings yet
L1-D3 Concepts of Data Analysis
17 pages
Ch01 ICS422 04
No ratings yet
Ch01 ICS422 04
84 pages
02 Data
No ratings yet
02 Data
35 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
02 Data
No ratings yet
02 Data
66 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
8 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Descriptive Analytics
No ratings yet
Descriptive Analytics
4 pages
Lec 2
No ratings yet
Lec 2
26 pages
02 Data
No ratings yet
02 Data
24 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
Chapter 2 Descriptive Analytics I Nature of Data, Statistical Modeling, and Visualization
100% (1)
Chapter 2 Descriptive Analytics I Nature of Data, Statistical Modeling, and Visualization
54 pages
Unit - IV Part-2
No ratings yet
Unit - IV Part-2
41 pages
DM-Knowing Your Data
No ratings yet
DM-Knowing Your Data
56 pages
02 Data
No ratings yet
02 Data
65 pages
Analytical Decision Making
No ratings yet
Analytical Decision Making
27 pages
Unit 1b
No ratings yet
Unit 1b
69 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
CertPREP Instructor PPT ITDataAnlytics 03
No ratings yet
CertPREP Instructor PPT ITDataAnlytics 03
157 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
E-Note 33325 Content Document 20250319114322AM
No ratings yet
E-Note 33325 Content Document 20250319114322AM
69 pages
3 Data Visualization
No ratings yet
3 Data Visualization
75 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Lect 3
No ratings yet
Lect 3
51 pages
Data Science - Unit 2
No ratings yet
Data Science - Unit 2
57 pages
Applied Statistics (Unit 2)
No ratings yet
Applied Statistics (Unit 2)
25 pages
Sesi 13 Korelasi Dan Regresi Linier 2023
No ratings yet
Sesi 13 Korelasi Dan Regresi Linier 2023
58 pages
Workbook Part 1 Revised 2023 Header
No ratings yet
Workbook Part 1 Revised 2023 Header
19 pages
Correlation
No ratings yet
Correlation
25 pages
S1 Revision Worksheet For Pre-Mock 2 Month: 09 (March 2024) Chapters - 2, 3, 5
No ratings yet
S1 Revision Worksheet For Pre-Mock 2 Month: 09 (March 2024) Chapters - 2, 3, 5
4 pages
Activity # 6 Finding The Mean, Mode, and Median From Ungrouped Data
No ratings yet
Activity # 6 Finding The Mean, Mode, and Median From Ungrouped Data
5 pages
Full Assignment 1 (Math2565)
No ratings yet
Full Assignment 1 (Math2565)
7 pages
TGL 15 Desember 2023
No ratings yet
TGL 15 Desember 2023
10 pages
STAT-205 Probability and Statistics
No ratings yet
STAT-205 Probability and Statistics
3 pages
Understanding Skewness in Statistics
No ratings yet
Understanding Skewness in Statistics
13 pages
Unit+8 (Block 2)
No ratings yet
Unit+8 (Block 2)
24 pages
Observarion + (+1) TH Median of Continuous Frequency9
No ratings yet
Observarion + (+1) TH Median of Continuous Frequency9
9 pages
Box-And-Whisker Plot ONLY
No ratings yet
Box-And-Whisker Plot ONLY
4 pages
UTSSI2B028
No ratings yet
UTSSI2B028
6 pages
Notes and HW Standard Deviation - 231212 - 181951
No ratings yet
Notes and HW Standard Deviation - 231212 - 181951
3 pages
Worksheet 4
No ratings yet
Worksheet 4
7 pages
Biostatistics For Clinical and Public Health Research 1st Edition Reference Book Download
No ratings yet
Biostatistics For Clinical and Public Health Research 1st Edition Reference Book Download
17 pages
Statistics Assignment: Correlation Analysis
No ratings yet
Statistics Assignment: Correlation Analysis
2 pages
Assignment 3 Research Methodlogy 20040621068 PDF
No ratings yet
Assignment 3 Research Methodlogy 20040621068 PDF
2 pages
Q3 Module 18 CLMD4ASTAT&PROB 225-237
No ratings yet
Q3 Module 18 CLMD4ASTAT&PROB 225-237
13 pages
DSILYTC Session 5 - Descriptive Statistics
No ratings yet
DSILYTC Session 5 - Descriptive Statistics
99 pages
ADS LAB Merged
No ratings yet
ADS LAB Merged
86 pages
Regression Statistics
No ratings yet
Regression Statistics
29 pages
BPCC-104 Dedc 2021
No ratings yet
BPCC-104 Dedc 2021
4 pages
Central Tendency: Mean, Median, Mode
100% (1)
Central Tendency: Mean, Median, Mode
33 pages
Math10 Q4 Week2
No ratings yet
Math10 Q4 Week2
6 pages
F-Distribution & Variance Testing
No ratings yet
F-Distribution & Variance Testing
24 pages
Data Science Course: Confidence Interval
No ratings yet
Data Science Course: Confidence Interval
66 pages

Class4 DataPreprocessing DiscriptiveAnalytics 19aug2021

Uploaded by

Class4 DataPreprocessing DiscriptiveAnalytics 19aug2021

Uploaded by

19-08-2021

Data Modeling Inference

Need for Data Preprocessing

Data Preprocessing Techniques

Data Preprocessing Techniques

Data Preprocessing Techniques

Descriptive Data Summarization

Mean Years of 9.1

Mean Salary: 55.4

Mean: 9.1 55.4

Illustration: Median of attribute “Years of experience”

Positively Skewed Negatively Skewed

Illustration: 50th percentile of attribute “Years of

Illustration: 50th percentile of attribute “Years of

• The whiskers terminate at

• 1.5 x IQR is equivalent to 2.7σ from mean if the distribution

Outlier(s): The values are less than 1.5 x IQR

• Lower bound: Q1 – (1.5 x IQR) Upper bound: Q3 + (1.5 x IQR)

You might also like