Content
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Why preprocess the data?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
e.g., occupation=“ ”
noisy: containing errors or outliers
e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or
names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
Why Is Data Dirty?
Incomplete data may come from
“Not applicable” data value when collected
Different considerations between the time when the data was collected
and when it is analyzed.
Human/hardware/software problems
Noisy data (incorrect values) may come from
Faulty data collection instruments
Human error at data entry
Errors in data transmission
Inconsistent data may come from
Different data sources
Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
Why Is Data Preprocessing Important?
No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
Data warehouse needs consistent integration of quality
data
Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse
Multi-Dimensional Measure of Data
Quality
A well-accepted multidimensional view:
Accuracy
Completeness
Consistency
Timeliness
Believability
Interpretability
Broad categories:
Intrinsic, contextual, representational, and accessibility
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or
similar analytical results
Data discretization
Part of data reduction but with particular importance, especially for
numerical data
Forms of Data Preprocessing
Data Cleaning
Importance
“Data cleaning is one of the three biggest problems in data
warehousing”—Ralph Kimball
“Data cleaning is the number one problem in data
warehousing”—DCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data
Missing data may need to be inferred.
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing values
per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class:
smarter
the most probable value: inference-based such as Bayesian formula
or decision tree
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal
with possible outliers)
Simple Discretization Methods: Binning
Equal-width (distance) partitioning
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the width
of intervals will be: W = (B –A)/N.
The most straightforward, but outliers may dominate presentation
Skewed data is not handled well
Equal-depth (frequency) partitioning
Divides the range into N intervals, each containing approximately same
number of samples
Good data scaling
Managing categorical attributes can be tricky
Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Regression
Data can be smoothed by fitting the data to a
function, such as with regression.
Y1
Y1’ y=x+1
X1 x
Cluster Analysis
detect and remove outliers
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources, e.g.,
Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales,
e.g., metric vs. British units
Handling Redundancy in Data
Integration
Redundant data occur often when integration of multiple
databases
Object identification: The same attribute or object may have
different names in different databases
Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
Redundant attributes may be able to be detected by correlation
analysis
Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
Correlation Analysis (Numerical Data)
Correlation coefficient (also called Pearson’s product moment
coefficient)
rA, B =
( A − A)( B − B) ( AB) − n AB
=
(n − 1)AB (n − 1)AB
where n is the number of tuples, A and B are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(AB) is the sum of the AB cross-product.
If rA,B > 0, A and B are positively correlated (A’s values increase as
B’s). The higher, the stronger correlation.
rA,B = 0: independent;
rA,B < 0: negatively correlated
The Higher value , the stronger the correlation. Hence a higher
value indicate that A or B may be removed as a redundancy.
Positively and Negatively Correlated Data
Not Correlated Data
Correlation Analysis (Categorical Data)
Χ2 (chi-square) test
(Observed − Expected) 2
2 =
Expected
Expected=count(A=ai)*count(B=bi)/n
Chi-Square Calculation: An Example
E11=count(male)*count(fiction)/n=300*450/1500=90
(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
=
2
+ + + = 507.93
90 210 360 840
Χ2 (chi-square) calculation (numbers in parenthesis are expected
counts calculated based on the data distribution in the two categories)
It shows that like gender and science fiction are correlated in the group
Covariance of Numeric Data
correlation and covariance are two similar measures for assessing how much
two attributes change together. The mean values of A and B, respectively, are
also known as the expected values on A and B, that is,
Data Transformation
Smoothing: remove noise from data. It includes binning , regression and
clustering
Aggregation: summarization. E.g annual sale amount can be generated
from monthly sale and data cube construction for analysis at multiple
abstraction level
Concept hierarchy generation for nominal data: where attributes such
as street can be generalized to higher-level concepts, like city or country
Normalization: scaled to fall within a small, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
Data Transformation: Normalization
Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,000 is mapped to 73,000-12,000/98,000-12,000(1.0-
0)+0=0.716
Z-score normalization (μ: mean, σ: standard deviation):
v − A
v' =
A
Ex. Let μ = 54,000, σ = 16,000. Then $73,000 is mapped to
o 73,600−54,000 /16,000 = 1.225.
Conti..
Normalization by decimal scaling
v
v' = j
10
Where j is the smallest integer such that Max(|ν’|) < 1
Example recorded values - -722 to 821
Divide each value by j=1000
-28 normalizes to -.028
444 normalizes to 0.444
Practice example—200, 300, 400, 600, 1000 where min=0, max=1
Data Reduction Strategies
Why data reduction?
A database/data warehouse may store terabytes of data
Complex data analysis and mining on huge amounts of data can take a
long time, making such analysis impractical or infeasible.
Complex data analysis/mining may take a very long time to run on the
complete data set
Data reduction
Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical
results
Data reduction strategies
Dimensionality reduction — e.g., remove unimportant attributes
Data Compression
Numerosity reduction — e.g., fit data into models
Discretization and concept hierarchy generation
Data cube aggregation
Data Cube Aggregation
The lowest level of a data cube (base cuboid)
The aggregated data for an individual entity of interest
E.g., a customer in a phone calling data warehouse
Multiple levels of aggregation in data cubes
Further reduce the size of data to deal with
Reference appropriate levels
Use the smallest representation which is enough to solve the
task
Queries regarding aggregated information should be answered
using data cube, when possible
Dimensionality reduction
Dimensionality reduction is the process of reducing the number of
or attributes under consideration. Dimensionality reduction methods
include:
Attribute subset selection is a method of dimensionality reduction in
which irrelevant, weakly relevant, or redundant attributes or
dimensions are detected and removed
wavelet transforms and principal components analysis which
transform or project the original data onto a smaller space.
Attribute subset selection
Reduces the data set size by removing irrelevant or redundant attributes (or
dimensions).
Feature selection (i.e., attribute subset selection):
Select a minimum set of features such that the probability distribution of
different classes given the values for those features is as close as possible to
the original distribution given the values of all features
reduce # of patterns in the patterns, easier to understand
Heuristic methods (best & worst attributes determined using various
methods—Eg: Greedy):
Step-wise forward selection
Step-wise backward elimination
Combining forward selection and backward elimination
Decision-tree induction
Attribute subset selection
These methods are typically greedy in that, while searching through attribute
space, they always make what looks to be the best choice at the time. Their
strategy is to make a locally optimal choice in the hope that this will lead to a
globally optimal solution
Attribute subset selection
The stopping criteria for the methods may vary. The procedure
may employ a threshold on the measure used to determine when
to stop the attribute selection process.
Data Compression
String compression
There are extensive theories and well-tuned algorithms
Typically lossless
But only limited manipulation is possible without expansion
Audio/video compression
Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time
Data Compression
Original Data Compressed
Data
lossless
Original Data
Approximated
Dimensionality Reduction:
Wavelet Transformation
The discrete wavelet transform (DWT) is a linear signal processing technique
that, when applied to a data vector X, transforms it to a numerically different
vector, X’, of wavelet coefficients
we consider each tuple as an n-dimensional data vector, that is, X =(x1,x2,…,xn),
depicting n measurements made on the tuple from n database attributes.
A compressed approximation of the data can be retained by storing only a
small fraction of the strongest of the wavelet coefficients. For example, all
wavelet coefficients larger than some user-specified threshold can be retained.
Similar to discrete Fourier transform (DFT), but better lossy compression,
localized in space
Conti..
Method:
Length, L, of the input data vector must be an integer power of 2. (padding
with 0’s, when necessary)
Each transform has 2 functions: smoothing, difference which acts to bring
out the detailed features of the data.
Applies to pairs of data points in X, resulting in two set of data of length
L/2. In general, these represent a smoothed or low-frequency version of
the input data and the high frequency content of it, respectively.
Applies two functions recursively, until reaches the desired length
Selected values from the data sets obtained in the previous iterations are
designated the wavelet coefficients of the transformed data.
Dimensionality Reduction: Principal
Component Analysis (PCA)
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
Steps
Normalize input data: Each attribute falls within the same range
Compute k orthonormal (unit) vectors, i.e., principal components
Each input data (vector) is a linear combination of the k principal
component vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance. (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data
Works for numeric data only
Used when the number of dimensions is large
Numerosity Reduction
Reduce data volume by choosing alternative, smaller forms of data
representation
These techniques may be:
Parametric --A model is used to estimate the data, so that typically
only the data parameters need to be stored, instead of the actual
data
Regression and log-linear models
Nonparametric
Histograms
Clustering
Sampling and
Data cube aggregation
Regression and log-linear models
Regression and log-linear models can be used to approximate the given
data.
In linear regression, the data are modeled to fit a straight line. For
example, a random variable, y (called a response variable), can be
modeled as a linear function of another random variable, x (called a
predictor variable), with the equation
Y=w x + b
In the context of data mining, x and y are numeric database attributes.
The coefficients, w and b called regression coefficients
Multiple linear regression is an extension of (simple) linear
regression, which allows a response variable, y, to be modeled as a
linear function of two or more predictor variables.
Regression and log-linear models
Log-linear models approximate discrete multidimensional probability
distributions
Given a set of tuples in n dimensions (e.g., described by n attributes),
we can consider each tuple as a point in an n-dimensional space.
Log-linear models can be used to estimate the probability of each
point in a multidimensional space for a set of discretized attributes,
based on a smaller subset of dimensional combinations.
Log-linear models are therefore also useful for dimensionality
reduction (since the lower-dimensional points together typically
occupy less space than the original data points)
Histograms
Histograms use binning to approximate data distributions and are a
popular form of data reduction
Divide data into buckets and store average (sum) for each bucket
Partitioning rules:
Equal-width: width of each bucket range is uniform
Equal-frequency (or equal-depth) :each bucket contains roughly the
same number of contiguous samples
Histograms are highly effective at approximating both sparse and dense
data, as well as highly skewed and uniform data. The histograms for
single attributes can be extended for multiple attributes.
Multidimensional histograms can capture dependencies between
attributes.
The following data are a list of AllElectronics prices for
commonly sold items (rounded to the nearest dollar).
The numbers have been sorted:
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,
15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20,
20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Clustering
Clustering techniques consider data tuples as objects.
They partition the objects into groups, or clusters, so that objects
within a cluster are “similar” to one another and “dissimilar” to objects
in other clusters.
Similarity is commonly defined in terms of how “close” the objects are
in space, based on a distance function.
The “quality” of a cluster may be represented by its diameter, the
maximum distance between any two objects in the cluster.
Centroid distance is an alternative measure of cluster quality and is
defined as the average distance of each cluster object from the cluster
centroid
Sampling
Sampling can be used as a data reduction technique because it allows a
large data set to be represented by a much smaller random data sample
Suppose that a large data set, D, contains N tuples. Let’s look at the
most common ways that we could sample D for data reduction:
Simple random sample without replacement
Simple random sample with replacement
Cluster sample
Stratified sampling:
Approximate the percentage of each class (or subpopulation of
interest) in the overall database
Used in conjunction with skewed data
Sampling may not reduce database I/Os (page at a time).
Sampling (continued)
Once selected, can’t
be selected again
Raw Data Once selected, can be
selected again
Sampling (continued)
If data is clustered or stratified, performed a simple random sample
(with or without replacement) in each cluster or Stratified
Raw Data Cluster/Stratified Sample
Discretization
Three types of attributes:
Nominal — values from an unordered set, e.g., color, profession
Ordinal — values from an ordered set, e.g., military or academic
rank
Continuous — real numbers, e.g., integer or real numbers
Discretization:
Divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical attributes.
Reduce data size by discretization
Prepare for further analysis
Discretization and Concept Hierarchy
Discretization
Reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals
Interval labels can then be used to replace actual data values
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Concept hierarchy formation
Recursively reduce the data by collecting and replacing low level
concepts (such as numeric values for age) by higher level concepts
(such as young, middle-aged, or senior)
Discretization and Concept Hierarchy Generation for
Numeric Data
Typical methods: All the methods can be applied recursively
Binning (covered above)
Top-down split, unsupervised,
Histogram analysis (covered above)
Top-down split, unsupervised
Clustering analysis (covered above)
Either top-down split or bottom-up merge, unsupervised
Entropy-based discretization: supervised, top-down split
Interval merging by 2 Analysis: unsupervised, bottom-up merge
Segmentation by natural partitioning: top-down split, unsupervised
Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two intervals S1 and S2
using boundary T, the information gain after partitioning is
| S1 | |S |
I (S , T ) = Entropy( S1) + 2 Entropy( S 2)
|S| |S|
Entropy is calculated based on class distribution of the samples in the set.
Given m classes, the entropy of S1 is
m
Entropy( S1 ) = − pi log 2 ( pi )
i =1
where pi is the probability of class i in S1
The boundary that minimizes the entropy function over all possible
boundaries is selected as a binary discretization
The process is recursively applied to partitions obtained until some
stopping criterion is met
Such a boundary may reduce data size and improve classification accuracy
Interval Merge by 2 Analysis
Merging-based (bottom-up) vs. splitting-based methods
Merge: Find the best neighboring intervals and merge them to form larger
intervals recursively
ChiMerge
Initially, each distinct value of a numerical attr. A is considered to be one
interval
2 tests are performed for every pair of adjacent intervals
Adjacent intervals with the least 2 values are merged together, since low
2 values for a pair indicate similar class distributions
This merge process proceeds recursively until a predefined stopping
criterion is met (such as significance level, max-interval, max
inconsistency, etc.)
Segmentation by Natural Partitioning
A simply 3-4-5 rule can be used to segment numeric data
into relatively uniform, “natural” intervals.
If an interval covers 3, 6, 7 or 9 distinct values at the
most significant digit, partition the range into 3 equi-
width intervals
If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into 4 intervals
If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals
Example of 3-4-5 Rule
count
Step 1: -$351 -$159 profit $1,838 $4,700
Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000
(-$1,000 - $2,000)
Step 3:
(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)
(-$400 -$5,000)
Step 4:
(-$400 - 0) ($2,000 - $5, 000)
(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
$200)
$1,200) ($2,000 -
-$300)
($200 - $3,000)
($1,200 -
(-$300 - $400)
$1,400)
-$200) ($3,000 -
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) $5,000)
($600 - ($1,600 -
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
Concept Hierarchy Generation for Categorical Data
Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by explicit data
grouping
{Urbana, Champaign, Chicago} < Illinois
Specification of only a partial set of attributes
E.g., only street < city, not others
Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
E.g., for a set of attributes: {street, city, state, country}
Automatic Concept Hierarchy Generation
Some hierarchies can be automatically generated based
on the analysis of the number of distinct values per
attribute in the data set
The attribute with the most distinct values is placed at
the lowest level of the hierarchy
Exceptions, e.g., weekday, month, quarter, year
country 15 distinct values
Province or state 365 distinct values
city 3567 distinct values
street 674,339 distinct values
Why Data Mining Primitives and Languages?
Finding all the patterns autonomously in a database? —
unrealistic because the patterns could be too many but
uninteresting
Data mining should be an interactive
process User directs what to be mined Users must be provided
with a set of primitives to be used to communicate with the data
mining system
Incorporating these primitives in a data mining query language
More flexible user interaction
Foundation for design of graphical user interface
Standardization of data mining industry and practice
What Defines a Data Mining Primitives
Task-relevant data
Type of knowledge to be mined
Background knowledge
Pattern interestingness measurements
Visualization of discovered patterns
Data Mining Primitives
Task-relevant data: This is the database portion to be
investigated. For example, suppose that you are a manager of All
Electronics in charge of sales in the United States and Canada. In
particular, you would like to study the buying trends of
customers in Canada. Rather than mining on the entire
database. These are referred to as relevant attributes
The kinds of knowledge to be mined: This specifies the data
mining functions to be performed, such as characterization,
discrimination, association, classification, clustering, or
evolution analysis. For instance, if studying the buying habits of
customers in Canada, you may choose to mine associations
between customer profiles and the items that these customers
like to buy
Data Mining Primitives
Background knowledge: Users can specify background
knowledge, or knowledge about the domain to be mined.
This knowledge is useful for guiding the knowledge
discovery process, and for evaluating the patterns found.
Interestingness measures: These functions are used to
separate uninteresting patterns from knowledge.
Simplicity: e.g., (association) rule length
Certainty: e.g., confidence, P(A|B)
Utility: potential usefulness, e.g., support (association)
Novelty: not previously known, surprising
Data Mining Primitives
Presentation and visualization of discovered
patterns:
This refers to the form in which discovered patterns are
to be displayed. Users can choose from different forms
for knowledge presentation, such as rules, tables, charts,
graphs, decision trees, and cubes.
A Data Mining Query Language (DMQL)
Motivation
A DMQL can provide the ability to support ad-hoc and
interactive data mining
By providing a standardized language like SQL
Design
DMQL is designed with the primitives described earlier
The DMQL can work with databases and data
warehouses as well. DMQL can be used to define data
mining tasks.
A Data Mining Query Language (DMQL)
Syntax for Task-Relevant Data Specification
use data warehouse data_warehouse_name
in relevance to att_or_dim_list
from relation(s)/cube(s) [where condition]
order by order_list
group by grouping_list
Syntax for Specifying the Kind of Knowledge
Characterization
mine characteristics [as pattern_name]
analyze {measure(s) }
The analyze clause, specifies aggregate measures, such as count, sum,
or count%.
A Data Mining Query Language (DMQL)
Association
mine associations [ as {pattern_name} ]
{matching {metapattern} }
Eg. mine associations as buyingHabits
matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)
where X is key of customer relation; P and Q are predicate variables;
and W, Y, and Z are object variables.
Classification
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
Eg. analyze credit_rating
Syntax for Concept Hierarchy Specification
use hierarchy <hierarchy> for <attribute_or_dimension>
Eg. define hierarchy time_hierarchy on date as [date,month
quarter,year]
A Data Mining Query Language (DMQL)
Syntax for Interestingness Measures Specification
with support threshold = 0.05
Syntax for Pattern Presentation and Visualization Specification
display as table
Full Specification of DMQL
use database AllElectronics_db
use hierarchy location_hierarchy for B.address
mine characteristics as customerPurchasing
analyze count%
in relevance to C.age,I.type,I.place_made from customer C, item I, purchase P,
items_sold S, branch B where I.item_ID = S.item_ID and P.cust_ID =
C.cust_ID and P.method_paid = "AmEx" and B.address = "Canada" and I.price
≥ 100 with
noise threshold = 5%
display as table