— Chapter 3 —
Data Preprocessing
1
Chapter 3: Data Preprocessing
Data Preprocessing: An Overview
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data Transformation and Data Discretization
2
Data Quality: Why Preprocess the Data?
Measures for data quality: A multidimensional view
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable, …
Consistency: some modied but some not,…
Timeliness: timely update?
Interpretability: how easily the data can be
understood?
3
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or les
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation 4
Forms of data preprocessing
5
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.
g., instrument faulty, human or computer error, transmission error
Incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g., Occupation=“ ” (missing data)
Noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
Inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
6
Incomplete (Missing) Data
Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
Equipment malfunction
Inconsistent with other recorded data and thus
deleted
Data not entered due to misunderstanding
Certain data may not be considered important at the
time of entry
7
How to Handle Missing Data?
8
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which require data cleaning
duplicate records
incomplete data
inconsistent data 9
How to Handle Noisy Data?
Binning
First sort data and partition into (equal-frequency)
bins
Then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Example:
Sorted data for price (in dollars):
4, 8, 15, 21, 21, 24, 25, 28, 34
10
How to Handle Noisy Data?
Regression
smooth by tting the data into regression functions
nding the “best” line to t two attributes (or variables)
so that one attribute can be used to predict the other.
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g.,
deal with possible outliers)
11
Three data clusters, outliers may be detected as values
that fall outside of the cluster sets.
12
Data Cleaning as a Process
Data discrepancy detection
Use metadata (e.g., domain, range, dependency,
distribution)
Check eld overloading
Check uniqueness rule, consecutive rule and null rule
Use commercial tools:
Data scrubbing: use simple domain knowledge (e.g.,
postal code, spell-check) to detect errors and make
corrections
Data auditing: by analyzing data to discover rules
and relationship to detect violators (e.g., correlation
and clustering to nd outliers)
13
Data Cleaning as a Process
Data migration and integration
Data migration tools: allow transformations to be
specied. Example: moving data from one location to
another, one format to another, or one application to
another.
ETL (Extraction/Transformation/Loading) tools: allow
users to specify transformations through a graphical
user interface (GUI).
Integration of the two processes
discrepancy detection and data transformation (to
correct discrepancies) iterates.
14
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identication problem:
Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
Detecting and resolving data value conicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales, e.g., 15
Handling Redundancy in Data Integration
Redundant data occur often when integration of
multiple databases
Object identication: The same attribute or object may have
different names in different databases
Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality 16
Some redundancies can be detected by correlation
analysis. Given two attributes, such analysis can
measure how strongly one attribute implies the other,
based on the available data.
2
For nominal data, we use the (chi-square) test.
For numeric attributes, we can use the correlation
coefficient and covariance, both of which access how
one attribute’s values vary from those of another.
17
Correlation Analysis (Nominal Data)
Χ2 (chi-square) test
( Observed Expected )
2
2
Expected
The larger the Χ2 value, the more likely the variables are
related
The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
18
Chi-Square Calculation: An Example
A group of 1500 people was surveyed. The gender of
each person was noted. Each person was polled as to
whether his or her preferred type of reading material
was ction or nonction. The observed frequency of
each possible joint event is summarized in the
contingency table:
male female Total
Fiction 250(90) 200(360) 450
Nonction 50(210) 1000(840) 1050
Total 300 1200 1500
19
Chi-Square Calculation: An Example
Χ2 (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data
distribution in the two categories)
( 250 90) ( 50 210 ) ( 200 360) 840) 2
2 2 2
(1000
2
507 . 93
90 210 360 840
For this 2x2 table, the degrees of freedom are (2-1)(2-1) =
1
The value needed to reject the hypothesis at the 0.001
signicance level is (10.828)
20
Since our computed value is above this, we can reject
the hypothesis that gender and preferred reading are
independent and conclude that the two attributes are
(strongly) correlated for the given group of people.
21
Correlation Analysis (Numeric Data)
Correlation coefficient (also called Pearson’s product
moment coefficient)
n n
( ai A )( bi B ) ( a i bi ) n AB
rA, B i 1
i 1
(n 1 ) A B (n 1 ) A B
A B
where n is the number of tuples, and are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
r = 0: independent; r < 0: negatively correlated 22
Correlation (viewed as linear relationship)
Correlation measures the linear relationship between
objects
To compute correlation, we standardize data objects, A
and B, and then take their dot product
a 'k ( a k mean ( A )) / std ( A )
b 'k ( bk mean ( B )) / std ( B )
correlatio n ( A , B ) A ' B '
23
Covariance (Numeric Data)
Covariance is similar to correlation
Correlation coefficient:
where
A n is the number of tuples,
B
and are the respective mean or expected values of A and B
σA and σB are the respective standard deviation of A and B.
24
Positive covariance: If CovA,B > 0, then A and B both tend to be
larger than their expected values.
Negative covariance: If CovA,B < 0 then if A is larger than its
expected value, B is likely to be smaller than its expected value.
Independence: CovA,B = 0 but the converse is not true:
Some pairs of random variables may have a covariance of 0 but are
not independent. Only under some additional assumptions (e.g., the
data follow multivariate normal distributions) does a covariance of 0
imply independence
25
Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 5), (4, 10), (5, 10), (6, 20).
Question: If the stocks are affected by the same industry trends,
will their prices rise or fall together?
26
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = $4
E(B) = (20 + 10 + 14 + 5 + 5) /5 = 54/5 = $10.80
The covariance between A and B is dened as:
Cov(A,B) = (6×20 + 5×10 + 4×14 + 3×5 + 2×5) − 4 × 10.80 / 5
= 50.2 – 43.2 = 7
Thus, A and B rise together since Cov(A, B) > 0.
27