Transfer Functions
Data Preprocessing
- Data Reduction
Data Preprocessing
Data Preprocessing: An Overview
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data Transformation and Data Discretization
2
2
Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data set that is
much smaller in volume but yet produces the same (or almost the same)
analytical results
Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the
complete data set.
Data reduction strategies
Dimensionality reduction, e.g., remove unimportant attributes
Numerosity reduction (some simply call it: Data Reduction)
Data compression
3
Data Reduction Strategies
Data reduction strategies
Dimensionality reduction, e.g., remove unimportant attributes
Wavelet transforms
Principal Components Analysis (PCA)
Feature subset selection, feature creation
Numerosity reduction (some simply call it: Data Reduction)
Regression and Log-Linear Models
Histograms, clustering, sampling
Data cube aggregation
Data compression
4
Data Reduction: Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering,
outlier analysis, becomes less meaningful
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
5
Visualization Problem
Not easy to visualize multivariate data
- 1D: dot
- 2D: Bivariate plot (i.e. X-Y plane)
- 3D: X-Y-Z plot
- 4D: ternary plot with a color code /Tetrahedron- 5D, 6D,
etc. : ???
Motivation
• Given data points in d dimensions
• Convert them to data points in r<d dimensions
• With minimal loss of information
Basics of PCA
PCA is useful when we need to extract useful information
from multivariate data sets.
This technique is based on the reduced dimensionality.
What is Principal Component
A principal component can be defined as a linear
combination of optimally-weighted observed variables.
What are the new axes?
Original Variable B PC 2
PC 1
Original Variable A
• Orthogonal directions of greatest variance in data
• Projections along PC1 discriminate the data most along any one axis
Principle Component Analysis
PCA:
Orthogonal projection of data onto lower-dimension
linear space that...
• maximizes variance of projected data (purple line)
• minimizes mean squared distance between
• data point and
• projections (sum of blue lines) 14
The Principal Components
• Vectors originating from the center of mass
• Principal component #1 points
in the direction of the largest variance.
• Each subsequent principal component…
• is orthogonal to the previous ones, and
• points in the directions of the largest
variance of the residual subspace
15
2D Gaussian dataset
16
1st PCA axis
17
2nd PCA axis
18
Principal component analysis
• Principal component analysis (PCA) is a procedure which
uses the correlations between the variables to identify
which combinations of variables capture most information
about the dataset
• Mathematically, it determines the eigenvectors of the
covariance matrix and sorts them in importance according
to their corresponding eigenvalues
Basics for Principal Component Analysis
• Orthogonal/Orthonormal
• Standard deviation, Variance, Covariance
• The Covariance matrix
• Eigenvalues and Eigenvectors
Covariance
• Standard Deviation and Variance are 1-dimensional
• How much do the dimensions vary from the mean with respect to each other ?
• Covariance measures between 2 dimensions
We easily see, if X=Y we end up with variance
Covariance Matrix
• Let X be a random vector.
• Then the covariance matrix of X, denoted by Cov(X), is
• The diagonals of Cov(X) are .
• In matrix notation,
The covariance matrix is symmetric
Orthogonality/Orthonormality
1.5 <v1,v2> = <(1 0),(0 1)>
= 0
1
0.5
0.5 1.0 1.5
• Two vectors v1 and v2 for which <v1,v2>=0 holds are said to be orthogonal
• Unit vectors which are orthogonal are said to be orthonormal.
Eigenvalues/Eigenvectors
• Let A be an nxn square matrix and x an nx1 column vector. Then a (right)
eigenvector of A is a nonzero vector x such that:
Eigenvalue Eigenvector
Procedure:
Finding the eigenvalues
=0 Finding lambdas
Find corresponding eigenvectors
Transformation
• Looking for a transformation of the data matrix X (pxn) such that
Y= T X=1 X1+ 2 X2+..+ p Xp
Transformation
What is a reasonable choice for the ?
Remember: We wanted a transformation that maximizes information
That means: captures Variance in the data
Maximize the variance of the projection of the observations on the Y
variables !
Find such that
Var(T X) is maximal
The matrix C=Var(X) is the covariance matrix of the Xi variables
Transformation
Can we intuitively see that in a picture?
Good Better
v( x1 ) c(x1,x2 ) ........c(x1,x p )
c(x1,x2 ) v( x2 ) ........c(x2 ,x p )
Cov(X)=
c(x ,x ) c(x ,x )..........v( x )
1 p 2 p p
PCA algorithm
(based on sample covariance matrix)
• Given data {x1, …, xm}, compute covariance matrix
1 m 1 m
(x i x)( x x) T where x xi
m i 1 m i 1
• PCA basis vectors = Compute the eigenvectors of
• Larger eigenvalue more important eigenvectors
29
PCA – zero mean
• Suppose we are given x1, x2, ..., xM (N x 1) vectors
N: # of features
Step 1: compute sample mean M: # data
M
1
x
M
x
i 1
i
Step 2: subtract sample mean (i.e., center data at zero)
Φi xi x
Step 3: compute the sample covariance matrix Σx
1 M
1 M
1 where A=[Φ1 Φ2 ... ΦΜ]
x
M
i 1
( x i x )( x i x )T
M
i 1
T
i
i
M
AAT
i.e., the columns of A are the Φi
(N x M matrix)
30
PCA - Steps
Step 4: compute the eigenvalues/eigenvectors of Σx
xui iui
where we assume 1 2 ... N
Note : most software packages return the eigenvalues (and corresponding eigenvectors)
is decreasing order – if not, you can explicitly put them in this order)
Since Σx is symmetric, <u1,u2,…,uN> form an orthogonal basis
in RN and we can represent any x∈RN as: x
x
y
y
1 1
2 2
N . .
x x yi ui y1u1 y2u2 ... y N u N
. .
xx:
. .
i 1
i.e., this is . .
just a “change” . .
(x x)T ui
yi T
( x x )T
ui if || ui || 1 of basis! xN y N
ui ui
Note : most software packages normalize ui to unit length to simplify calculations; if
not, you can explicitly normalize them) 31
PCA - Steps
Step 5: dimensionality reduction step – approximate x using
only the first K eigenvectors (K<<N) (i.e., corresponding to
the K largest eigenvalues where K is a parameter)
32
Example
• Compute the PCA of the following dataset:
(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)
• Compute the sample covariance matrix is:
• The eigenvalues can be computed by finding the roots of the
characteristic polynomial:
33
Example (cont’d)
• The eigenvectors are the solutions of the systems:
xui iui
Note: if ui is a solution, then cui is also a solution where c≠0.
Eigenvectors can be normalized to unit-length using:
vi
vˆi
|| vi ||
34
Choosing the projection dimension K ?
• K is typically chosen based on how much information
(variance) we want to preserve:
K
Choose the smallest i
K that satisfies
i 1
N
T where T is a threshold (e.g., 0.9)
the following
inequality:
i 1
i
• If T=0.9, for example, we “preserve” 90% of the information
(variance) in the data.
• If K=N, then we “preserve” 100% of the information in the
data (i.e., just a “change” of basis and xˆ x )
35
Data Normalization
• The principal components are dependent on the units used
to measure the original variables as well as on the range of
values they assume.
• Data should always be normalized prior to using PCA.
• A common normalization method is to transform all the data
to have zero mean and unit standard deviation:
xi where μ and σ are the mean and standard
deviation of the i-th feature xi
36