4
Data Transformation
■ A function that maps the entire set of values of a
given attribute to a new set of replacement
values s.t. each old value can be identified with
one of the new values
■ Methods
■ Smoothing: Remove noise from data
■ Attribute/feature construction
■ New attributes constructed from the given
ones
Forms of data preprocessing
6
Data Transformation
■ A function that maps the entire set of values of a given
attribute to a new set of replacement values s.t. each old
value can be identified with one of the new values
■ Aggregation: Summarization, data cube construction
■ Normalization: Scaled to fall within a smaller, specified
range
■ min-max normalization
■ z-score normalization
■ normalization by decimal scaling
■ Discretization: Concept hierarchy climbing
Normalization
■ Min-max normalization: to [new_minA, new_maxA]
■ Ex. Let income range $12,000 to $98,000 normalized to [0.0,
1.0]. Then $73,000 is mapped to
■ Z-score normalization (μ: mean, σ: standard deviation):
■ Ex. Let μ = 54,000, σ = 16,000. Then
■ Normalization by decimal scaling
Where j is the smallest integer such that Max(|ν’|) < 1
7
Discretization
■ Three types of attributes
■ Nominal—values from an unordered set, e.g.,
color, profession
■ Ordinal—values from an ordered set, e.g.,
military or academic rank
■ Numeric—real numbers, e.g., integer or real
numbers
8
Discretization
■ Discretization: Divide the range of a continuous attribute
into intervals
■ Interval labels can then be used to replace actual data
values
■ Reduce data size by discretization
■ Supervised vs. unsupervised
■ Split (top-down) vs. merge (bottom-up)
■ Discretization can be performed recursively on an
attribute
■ Prepare for further analysis, e.g., classification
9
10
Data Discretization Methods
■ Typical methods: All the methods can be applied
recursively
■ Binning
■ Top-down split, unsupervised
■ Histogram analysis
■ Top-down split, unsupervised
■ Clustering analysis (unsupervised, top-down split or
bottom-up merge)
■ Decision-tree analysis (supervised, top-down split)
■ Correlation (e.g., χ2) analysis (unsupervised, bottom-up
merge)
11
Simple Discretization: Binning
■ Equal-width (distance) partitioning
■ Divides the range into N intervals of equal size: uniform grid
■ if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
■ The most straightforward, but outliers may dominate presentation
■ Skewed data is not handled well
■ Equal-depth (frequency) partitioning
■ Divides the range into N intervals, each containing approximately
same number of samples
■ Good data scaling
■ Managing categorical attributes can be tricky
12
Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
13
Discretization Without Using Class Labels
(Binning vs. Clustering)
Data Equal interval width (binning)
Equal frequency (binning) K-means clustering leads to better results
14
Cluster Analysis
Allows detection and removal of
outliers
Regression
y
Y1
Y1’ y=x+1
X1 x
Linear regression – find the best line to fit two
variables and
use regression function to smooth data
Data Integration
■ Data integration:
■ combines data from multiple sources into a coherent store
■ Schema integration
■ integrate metadata from different sources
■ Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id ≡ B.cust-#
■ Detecting and resolving data value conflicts
■ for the same real world entity, attribute values from different
sources are different
■ possible reasons: different representations, different scales, e.g.,
metric vs. British units
Handling Redundant Data in
Data Integration
■ Redundant data occur often when integration of multiple
databases
■ The same attribute may have different names in different
databases
■ One attribute may be a “derived” attribute in another table, e.g.,
annual revenue
■ Redundant data may be able to be detected by
correlational analysis
■ Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
Correlational Analysis
■ R A,B = Sum (A-A’) (B-B’)
(n-1) sdAsdB
Where A’ = mean value of A
sum (A)
n
sdA = standard deviation of A
SqRoot ( Sum (A-A’)2)
n-1
<0 negatively correlated, =0 no correlation, >0
correlated – consider removal of A or B
Correlational Analysis Example
❖ A – 2, 5, 6, 8, 22, 33, 44, 55
❖ B – 6, 7, 22, 33, 44, 66, 67, 70
❖ A’ = 22, B’ = 45
❖ Sum (A-A’) = -1, Sum (B-B’) = -45
❖ sdA = .378, sdB = 17.008
❖ RA,B = 45/45.003 = .999
RA,B>0 - correlated – consider removal of A or B
Data Transformation
■ Smoothing: remove noise from data
■ Aggregation: summarization, data cube construction
■ Generalization: concept hierarchy climbing
■ Normalization: scaled to fall within a small, specified
range
■ min-max normalization
■ z-score (zero mean) normalization
■ normalization by decimal scaling
■ Attribute/feature construction
■ New attributes constructed from the given ones to help in the
data mining process
Data Transformation:
Normalization
■ min-max normalization
■ Example – income, min $55,000, max $150000 – map
to 0.0 – 1.0
■ $73,600 is transformed to :
■ 73600-55000 (1.0 – 0) + 0 = 0.196
150000-55000
Discretization by Classification &
Correlation Analysis
■ Classification (e.g., decision tree analysis)
■ Supervised: Given class labels, e.g., cancerous vs. benign
■ Using entropy to determine split point (discretization point)
■ Top-down, recursive split
■ Correlation analysis (e.g., Chi-merge: χ2-based discretization)
■ Supervised: use class information
■ Bottom-up merge: find the best neighboring intervals (those
having similar distributions of classes, i.e., low χ2 values) to merge
■ Merge performed recursively, until a predefined stopping condition
23
Concept Hierarchy Generation
■ Concept hierarchy organizes concepts (i.e., attribute
values) hierarchically and is usually associated with each
dimension in a data warehouse
■ Concept hierarchies facilitate drilling and rolling in data
warehouses to view data in multiple granularity
■ Concept hierarchy formation: Recursively reduce the data
by collecting and replacing low level concepts (such as
numeric values for age) by higher level concepts (such as
youth, adult, or senior)
24
Concept Hierarchy Generation
■ Concept hierarchies can be explicitly specified by
domain experts and/or data warehouse designers
■ Concept hierarchy can be automatically formed for
both numeric and nominal data. For numeric data,
use discretization methods shown.
25
Concept Hierarchy Generation
for Nominal Data
■ Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
■ street < city < state < country
■ Specification of a hierarchy for a set of values by explicit
data grouping
■ {Urbana, Champaign, Chicago} < Illinois
■ Specification of only a partial set of attributes
■ E.g., only street < city, not others
■ Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
■ E.g., for a set of attributes: {street, city, state, country}
26
Automatic Concept Hierarchy Generation
■ Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
■ The attribute with the most distinct values is placed at
the lowest level of the hierarchy
■ Exceptions, e.g., weekday, month, quarter, year
country 15 distinct values
province_or_ state 365 distinct values
city 3567 distinct values
street 674,339 distinct values
27
Summary
■ Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
■ Data cleaning: e.g. missing/noisy values, outliers
■ Data integration from multiple sources:
■ Entity identification problem
■ Remove redundancies
■ Detect inconsistencies
■ Data reduction
■ Dimensionality reduction
■ Numerosity reduction
■ Data compression
■ Data transformation and data discretization
■ Normalization
■ Concept hierarchy generation
Next Session Data Cube Technology
28