19-08-2021
Data Preprocessing
Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge
Data Modeling Inference
Data Collection (Machine
Learning)
Data Preprocessing
Data
Feature
Database Cleaning and
Representation
Cleansing
2
1
19-08-2021
Need for Data Preprocessing
• Real world data are tend to be incomplete, noisy and
inconsistent due to their huge size and their likely origin
from multiple heterogeneous sources
• Preprocessing is important to clean the data
• Low quality data will lead to low quality of analysis results
• If the users believe the data is of low quality (dirty), they
are unlikely to trust the results of any data analytics that
has been applied to
• Low quality data can cause confusion for analytic procedure
using machine learning techniques, resulting in unreliable
output
• Data could be
– Incomplete,
– noisy and
– inconsistent
– These are common properties of large real world databases
Data Preprocessing Techniques
• Data cleaning:
• Data integration:
• Data transformation:
• Data reduction :
2
19-08-2021
Data Preprocessing Techniques
• Data cleaning:
– Applied to
• identify the missing values,
• fill in missing values,
• remove noise and
• correct inconsistency in the data
• Data integration:
– It merges data from multiple sources in to a coherent
data source
• Data transformation:
– Transforming the entries of data to a common format
– Techniques like normalization and standardization
applied to transform the data to another form to
improve the accuracy and efficiency of machine learning
(ML) algorithms involving distance measures
Data Preprocessing Techniques
• Data reduction:
– Applied to obtain a reduced representation that is much
smaller in volume, yet producing almost same analytical
results
– It can reduce the data size by
• Aggregation
• Eliminating irrelevant and redundant features (attributes)
through correlation analysis
• Reducing dimension
• These techniques are not mutually exclusive; they
may work together
3
19-08-2021
Descriptive Data Summarization
(Descriptive Analytics)
• It serves as a foundation for data preprocessing
• It helps us to study the general characteristics of data
and identify the presence of noise or outliers
• Data characteristics:
– Central tendency of data
• Centre of the data
• Measuring mean, median and mode
– Dispersion of data
• The degree to which numerical data tend to spread
• Measuring range, quartiles, interquartile range (IQR), the
five-number summery and standard deviation
Descriptive Analytics:
Measuring Central Tendency
• Mean: Number of records
(tuples), N = 10
– Let x1, x2, …, xN be a set of N
Years of Salary (in
values in an attribute. Mean of experience Rs 1000)
this set of values is given by
3 30
1 N 8 57
N
x
i 1
i 9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83
Sum: 91
4
19-08-2021
Descriptive Analytics:
Measuring Central Tendency
• Mean: Number of records
(tuples), N = 10
– Let x1, x2, …, xN be a set of N
Years of Salary (in
values in an attribute. Mean of experience Rs 1000)
this set of values is given by
3 30
1 N 8 57
N
x
i 1
i 9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83
Mean Years of 9.1
experience: Sum/10
Descriptive Analytics:
Measuring Central Tendency
• Mean: Number of records
(tuples), N = 10
– Let x1, x2, …, xN be a set of N
Years of Salary (in
values in an attribute. Mean of experience Rs 1000)
this set of values is given by
3 30
1 N 8 57
N
x
i 1
i 9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83
Mean Salary: 55.4
Sum/10
5
19-08-2021
Descriptive Analytics:
Measuring Central Tendency
• Mean: Number of records
(tuples), N = 10
– Let x1, x2, …, xN be a set of N
Years of Salary (in
values in an attribute. Mean of experience Rs 1000)
this set of values is given by
3 30
1 N 8 57
N
x
i 1
i 9 64
13 72
3 36
– Mean is a better measure of 6 43
central tendency for the 11 59
symmetric data 21 90
(symmetrically distributed 1 20
data) 16 83
Mean: 9.1 55.4
Descriptive Analytics:
Measuring Central Tendency
• Median: Number of records
– Let x1, x2, …, xN be a set of N values (tuples), N = 10
in an attribute. The median is the Years of Salary (in
"middle" number (value), when experience Rs 1000)
those numbers are listed in order 3 30
from smallest to greatest. 8 57
– Median is the value separating the 9 64
higher half from the lower half of 13 72
a data sample
3 36
– For a given data of N values in sorted 6 43
order
11 59
• If N is odd, then median is the middle
value of the ordered list 21 90
• If N is even, then median is the 1 20
average of middle two values 16 83
Illustration: Median of attribute “Years of experience”
6
19-08-2021
Descriptive Analytics:
Measuring Central Tendency
• Median: Sort the values in “Years
– Let x1, x2, …, xN be a set of N values in of experience”
an attribute. The median is the Years of
"middle" number (value), when experience
those numbers are listed in order 1
from smallest to greatest. 3
– Median is the value separating the 6
higher half from the lower half of 8
a data sample
9
– For a given data of N values in sorted
11
order
13
• If N is odd, then median is the middle
value of the ordered list 16
• If N is even, then median is the 16
average of middle two values 21
Descriptive Analytics:
Measuring Central Tendency
• Median: Sort the values in “Years
– Let x1, x2, …, xN be a set of N values in of experience”
an attribute. The median is the Years of
"middle" number (value), when experience
those numbers are listed in order 1
from smallest to greatest. 3
– Median is the value separating the 6
higher half from the lower half of 8
a data sample
9
– For a given data of N values in sorted
11
order
13
• If N is odd, then median is the middle
value of the ordered list 16
• If N is even, then median is the 16
average of middle two values 21
Median:
7
19-08-2021
Descriptive Analytics:
Measuring Central Tendency
• Median: Sort the values in “Years
– Let x1, x2, …, xN be a set of N values in of experience”
an attribute. The median is the Years of
"middle" number (value), when experience
those numbers are listed in order 1
from smallest to greatest. 3
– Median is the value separating the 6
higher half from the lower half of 8
a data sample
9
– For a given data of N values in sorted
11
order
13
• If N is odd, then median is the middle
value of the ordered list 16
• If N is even, then median is the 16
average of middle two values 21
– For asymmetrically distributed
Median: 10
(skewed) data, a better measure of
centre of data is median
Descriptive Analytics:
Measuring Central Tendency
• Mode: Most frequent value in an attribute in the data
Number of records
(tuples), N = 10
Years of Salary (in
experience Rs 1000)
Illustration: Mode of attribute 3 30
“Years of experience” 8 57
Assume that values are discrete 9 64
numerical
13 72
3 36
6 43
11 59
21 90
1 20
16 83
Mode: 3
8
19-08-2021
Descriptive Analytics:
Measuring Central Tendency
• Mode: Most frequent value in an attribute in the data
Number of samples, N = 61 • The mode of a continuous
Date Temperature
variable is the value at which
the probability density function,
Sept 1 25.47
f(x) , is at a maximum.
Sept 2 26.19
Sept 3 25.17 • It is a value that is most likely
Sept 4 24.30 to lie within the same interval as
Sept 5 24.07 the outcome
Sept 6 21.21
Sept 7 23.49
Sept 8 21.79
Sept 9 25.09
Sept 10 25.39
--- ---
Oct 29 23.06
Oct 30 23.72
Oct 31 23.02
Mean: 22.85
Mode: (22.32 – 23.62]
Median: 22.89
Descriptive Analytics:
Measuring Central Tendency
Positively Skewed Negatively Skewed
Symmetric Data
Data Data
9
19-08-2021
Descriptive Analytics:
Measuring Dispersion of Data
• The degree to which numerical data tend to spread
• It is also called as variance (in symmetrically
distributed data)
• Common measures of data dispersion:
– Range
– The five-number summery (based on quartiles)
– The inter quartile range (IQR)
– Standard deviation
• Range: The range of a finite set of values is the
difference between the maximum and minimum
values
Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Number of records
(tuples), N = 10
– The kth percentile:
Years of Salary (in
• Let x1, x2, …, xN be a set of N experience Rs 1000)
values in an attribute 3 30
• The kth percentile of a set of data 8 57
in numerical order is the value of 9 64
xn having the property that k
13 72
percent of data entries lie at or
below xn 3 36
6 43
– Example: 50th percentile
11 59
• The value (number) below which
50% of the data entries (values) 21 90
lie 1 20
– Those 50% of entries have values 16 83
equal to or less that 50th
percentile
Illustration: 50th percentile of attribute “Years of
experience”
10
19-08-2021
Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Sort the values in “Years
of experience”
– The kth percentile:
Years of
• Let x1, x2, …, xN be a set of N experience
values in an attribute 1
• The kth percentile of a set of data 3
in numerical order is the value of 6
xn having the property that k
8
percent of data entries lie at or
below xn 9
11
– Example: 50th percentile
13
• The value (number) below which
50% of the data entries (values) 16
lie 16
– Those 50% of entries have values 21
equal to or less that 50th
percentile 50th Percentile: 10
Illustration: 50th percentile of attribute “Years of
experience”
Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Sort the values in “Years
of experience”
– The kth percentile:
Years of
• Let x1, x2, …, xN be a set of N experience
values in an attribute 1
• The kth percentile of a set of data 3
in numerical order is the value of 6
xn having the property that k
8
percent of data entries lie at or
below xn 9
11
– Example: 25th percentile
13
• The value (number) below which
25% of the data entries (values) 16
lie 16
– Those 25% of entries have values 21
equal to or less that 25th
percentile 25th Percentile: 6
• Middle element between minimum
and 50th percentile
Illustration: 25th percentile of attribute “Years of experience”
11
19-08-2021
Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Sort the values in “Years
of experience”
– The kth percentile:
Years of
• Let x1, x2, …, xN be a set of N experience
values in an attribute 1
• The kth percentile of a set of data 3
in numerical order is the value of 6
xn having the property that k
8
percent of data entries lie at or
below xn 9
11
– Example: 75th percentile
13
• The value (number) below which
75% of the data entries (values) 16
lie 16
– Those 75% of entries have values 21
equal to or less that 75th
percentile 75th Percentile: 16
• Middle element between
maximum and 50th percentile
Illustration: 75th percentile of attribute “Years of experience”
Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles:
– The kth percentile:
• Let x1, x2, …, xN be a set of N values in an attribute
• The kth percentile of a set of data in numerical order is the
value of xn having the property that k percent of data
entries lie at or below xn
• Median is the 50th percentile (the second quartile (Q2))
• The first quartile (Q1): It is the 25th percentile
• The third quartile (Q3): It is the 75th percentile
– The quartiles including median give some indication of
centre, spread and shape of distribution
• The distance between the Q1 and Q3 is a simple
measure of spread
• Inter quartile range (IQR): Distance between the first
quartile (Q1) and third quartile (Q2)
IQR = Q3 – Q1
12
19-08-2021
Descriptive Analytics:
Measuring Dispersion of Data
• The five-number summery of distribution:
– It consists of minimum value, Q1, median, Q3 and maximum
value
• Box plots are the popular way of visualising distribution
Largest
observation (max) Q3
(top whisker) Q2/
IQR
Median
Q1
Smallest
observation (min)
(bottom whisker)
• The whiskers terminate at
– Smallest (minimum) or largest (maximum) observations or
– the most extreme observations occurring within 1.5 x IQR of
respective quartiles (Q1 and Q3)
Descriptive Analytics:
Measuring Dispersion of Data
• The five-number summery of distribution:
– It consists of minimum value, Q1, median, Q3 and maximum
value
• Box plots are the popular way of visualising distribution
Largest
observation (max) Q3
(top whisker) Q2/
IQR
Median
Q1
Smallest
observation (min)
(bottom whisker)
• 1.5 x IQR is equivalent to 2.7σ from mean if the distribution
is normal distribution
– It is close to 3σ from mean which is a standard in normal distribution
13
19-08-2021
Descriptive Analytics:
Measuring Dispersion of Data
• The five-number summery of distribution:
– It consists of minimum value, Q1, median, Q3 and maximum
value
• Box plots are the popular way of visualising distribution
Largest
observation (max) Q3
(top whisker) Q2/
IQR
Median
Q1
Outlier: The values are larger Smallest
than 1.5 x IQR observation (min)
(bottom whisker)
Outlier(s): The values are less than 1.5 x IQR
• Lower bound: Q1 – (1.5 x IQR) Upper bound: Q3 + (1.5 x IQR)
• Outliers: Any datapoint less than the lower bound and
larger than the upper bound
Descriptive Analytics:
Measuring Dispersion of Data
• Variance (σ2):
– Let x1, x2, …, xN be a set of N values in an attribute.
variance (σ2) of this set of values is given by
1 N
2 xi 2 μ = mean
N 1 i 1
• Standard deviation (σ):
– The square root of variance Variance
• Standard deviation measures the spread about the
mean
– It is used when the mean is chosen as the measure of
centre, especially in symmetric distribution
• The quartiles Q1 and Q3 measure the spread about
median
– Q1 and Q3 are used when the median is chosen as the
measure of centre, especially in skewed distribution
28
14