Data Mining:
Concepts and Techniques
— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 3: Data Preprocessing
Data Preprocessing: An Overview
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Reduction
Data Transformation and Data Discretization
2
Data Quality: Why Preprocess the Data?
Measures for data quality: A multidimensional view
Accuracy: accurate or noisy (containing errors, or values
that deviate from the expected)
Completeness: not recorded (lacking attribute values or
certain attributes of interest …)
Consistency: e.g. discrepancy in the department codes used
to categorize items
Timeliness: timely update?
Believability: how much the data are trustable by users
Interpretability: how easily the data can be understood?
3
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction (e.g. sampling)
Data transformation and data discretization
Normalization
…
4
Major Tasks in Data Preprocessing
5
Chapter 3: Data Preprocessing
Data Preprocessing: An Overview
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Reduction
Data Transformation and Data Discretization
6
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
incomplete: lacking feature values, lacking certain features of
interest, or containing only aggregate data
e.g., Occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
7
Incomplete (Missing) Data
Data is not always available
E.g., many tuples have no recorded value for several
features, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the
time of entry
not register history or changes of the data
Missing data may need to be inferred
8
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per feature varies considerably
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., “unknown”, a new class?!
the feature mean
the feature mean for all samples belonging to the same
class: smarter
the most probable value: inference-based such as
Bayesian formula or decision tree
9
Noisy Data
Noise: random error or variance in a measured variable
Incorrect feature values may be due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which require data cleaning
duplicate records
incomplete data
inconsistent data
10
How to Handle Noisy Data?
Binning
First sort data and partition into (equal-frequency) bins
Then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
11
How to Handle Noisy Data (cont.)
Regression
smooth by fitting the data into regression functions
12
How to Handle Noisy Data (cont.)
Clustering
detect and remove outlier
13
Data Cleaning as a Process
Data discrepancy detection
Use metadata (e.g., domain, range, dependency, distribution)
Check field overloading
Check uniqueness rule, consecutive rule and null rule
Use commercial tools
Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering
to find outliers)
Data migration and integration
Data migration tools: allow transformations to be specified
ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
Integration of the two processes
Iterative and interactive (e.g., Potter’s Wheels)
14
Chapter 3: Data Preprocessing
Data Preprocessing: An Overview
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Reduction
Data Transformation and Data Discretization
15
Feature Engineering
Feature Extraction / Construction aims to reduce the number
of features in a dataset by creating new features from the existing
ones (and then discarding the original features).
e.g. PCA
Feature Selection: Instead of creating new features, Feature
Selection focuses on choosing a subset of the existing features
that contribute most significantly to the problem.
This process eliminates irrelevant or redundant features while
preserving the important ones.
e.g. Feature Subset Selection
Feature Creation / Generation: Create new features that can
capture the important information in a data set more effectively
than the original ones.
16
Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
Data reduction strategies
Dimensionality reduction, e.g., remove unimportant features
Principal Components Analysis (PCA)
Feature subset selection, feature creation
Numerosity reduction (some simply call it: Data Reduction)
Regression and Log-Linear Models
Histograms, clustering, sampling
Data cube aggregation
Data compression
17
Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
18
Feature Subset Selection
Another way to reduce dimensionality of data
Redundant features
Duplicate much or all of the information contained in
one or more other features
E.g., purchase price of a product and the amount of
sales tax paid
Irrelevant features
Contain no information that is useful for the data
mining task at hand
E.g., students' ID is often irrelevant to the task of
predicting students' GPA
19
Clustering
Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
Can be very effective if data is clustered but not if data
is “smeared”
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms
20
Sampling
Sampling: obtaining a small sample s to represent the
whole data set N
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Key principle: Choose a representative subset of the data
Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods, e.g., stratified
sampling:
Note: Sampling may not reduce database I/Os (page at a
time)
21
Types of Sampling
Simple random sampling
There is an equal probability of selecting any particular
item
Sampling without replacement
Once an object is selected, it is removed from the
population
Sampling with replacement
A selected object is not removed from the population
Stratified sampling:
Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
Used in conjunction with skewed data
22
Sampling: With or without Replacement
Raw Data
23
Sampling: Cluster or Stratified Sampling
Raw Data Cluster/Stratified Sample
24
Chapter 3: Data Preprocessing
Data Preprocessing: An Overview
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Reduction
Data Transformation and Data Discretization
25
Data Transformation
A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
Methods
Smoothing: Remove noise from data
Attribute/feature construction
New attributes constructed from the given ones
Aggregation: Summarization, data cube construction
Normalization: Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Discretization: Concept hierarchy climbing 26
Normalization
min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12000 to $98000 normalized to [0.0, 1.0].
73600 12000
Then $73000 is mapped to 98000 12000 (1.0 0) 0 0.716
z-score normalization (μ: mean, σ: standard deviation):
v A
v'
A
73600 54000
Ex. Let μ = 54000, σ = 16000. Then 1.225
16000
Normalization by decimal scaling:
v
v' j Where j is the smallest integer such that max (|ν’|) < 1
10
27