0% found this document useful (0 votes)

56 views36 pages

Lecture 9 - Data Reduction

Data

Uploaded by

raoseshu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views36 pages

Lecture 9 - Data Reduction

Data

Uploaded by

raoseshu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Transfer Functions

Data Preprocessing
- Data Reduction
Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

2
2
Data Reduction Strategies

 Data reduction: Obtain a reduced representation of the data set that is

much smaller in volume but yet produces the same (or almost the same)
analytical results

 Why data reduction? — A database/data warehouse may store terabytes of

data. Complex data analysis may take a very long time to run on the
complete data set.

 Data reduction strategies

 Dimensionality reduction, e.g., remove unimportant attributes
 Numerosity reduction (some simply call it: Data Reduction)
 Data compression

3
Data Reduction Strategies

 Data reduction strategies

 Dimensionality reduction, e.g., remove unimportant attributes

 Wavelet transforms

 Principal Components Analysis (PCA)

 Feature subset selection, feature creation

 Numerosity reduction (some simply call it: Data Reduction)

 Regression and Log-Linear Models

 Histograms, clustering, sampling

 Data cube aggregation

 Data compression

4
Data Reduction: Dimensionality Reduction

 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering,
outlier analysis, becomes less meaningful
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

5
Visualization Problem
 Not easy to visualize multivariate data
 - 1D: dot

 - 2D: Bivariate plot (i.e. X-Y plane)

 - 3D: X-Y-Z plot

 - 4D: ternary plot with a color code /Tetrahedron- 5D, 6D,

etc. : ???
Motivation

• Given data points in d dimensions

• Convert them to data points in r<d dimensions
• With minimal loss of information
Basics of PCA
 PCA is useful when we need to extract useful information
from multivariate data sets.

 This technique is based on the reduced dimensionality.

What is Principal Component

 A principal component can be defined as a linear

combination of optimally-weighted observed variables.
What are the new axes?

Original Variable B PC 2
PC 1

Original Variable A

• Orthogonal directions of greatest variance in data

• Projections along PC1 discriminate the data most along any one axis
Principle Component Analysis

PCA:
Orthogonal projection of data onto lower-dimension
linear space that...
• maximizes variance of projected data (purple line)

• minimizes mean squared distance between

• data point and
• projections (sum of blue lines) 14
The Principal Components
• Vectors originating from the center of mass

• Principal component #1 points

in the direction of the largest variance.

• Each subsequent principal component…

• is orthogonal to the previous ones, and
• points in the directions of the largest
variance of the residual subspace

15
2D Gaussian dataset

16
1st PCA axis

17
2nd PCA axis

18
Principal component analysis
• Principal component analysis (PCA) is a procedure which
uses the correlations between the variables to identify
which combinations of variables capture most information
about the dataset

• Mathematically, it determines the eigenvectors of the

covariance matrix and sorts them in importance according
to their corresponding eigenvalues
Basics for Principal Component Analysis

• Orthogonal/Orthonormal

• Standard deviation, Variance, Covariance

• The Covariance matrix

• Eigenvalues and Eigenvectors

Covariance

• Standard Deviation and Variance are 1-dimensional

• How much do the dimensions vary from the mean with respect to each other ?

• Covariance measures between 2 dimensions

We easily see, if X=Y we end up with variance

Covariance Matrix

• Let X be a random vector.

• Then the covariance matrix of X, denoted by Cov(X), is

• The diagonals of Cov(X) are .

• In matrix notation,

The covariance matrix is symmetric

Orthogonality/Orthonormality

1.5 <v1,v2> = <(1 0),(0 1)>

= 0
1
0.5

0.5 1.0 1.5

• Two vectors v1 and v2 for which <v1,v2>=0 holds are said to be orthogonal

• Unit vectors which are orthogonal are said to be orthonormal.

Eigenvalues/Eigenvectors

• Let A be an nxn square matrix and x an nx1 column vector. Then a (right)
eigenvector of A is a nonzero vector x such that:

Eigenvalue Eigenvector

Procedure:
Finding the eigenvalues

=0 Finding lambdas

Find corresponding eigenvectors

Transformation

• Looking for a transformation of the data matrix X (pxn) such that

Y= T X=1 X1+ 2 X2+..+ p Xp

Transformation

What is a reasonable choice for the  ?

Remember: We wanted a transformation that maximizes information

That means: captures Variance in the data

Maximize the variance of the projection of the observations on the Y

variables !
Find  such that

Var(T X) is maximal

The matrix C=Var(X) is the covariance matrix of the Xi variables

Transformation
Can we intuitively see that in a picture?

Good Better
 v( x1 ) c(x1,x2 ) ........c(x1,x p ) 
 
 c(x1,x2 ) v( x2 ) ........c(x2 ,x p ) 
Cov(X)=  
 
 c(x ,x ) c(x ,x )..........v( x ) 
 1 p 2 p p 
PCA algorithm
(based on sample covariance matrix)
• Given data {x1, …, xm}, compute covariance matrix 

1 m 1 m
   (x i  x)( x  x) T where x   xi
m i 1 m i 1

• PCA basis vectors = Compute the eigenvectors of 

• Larger eigenvalue  more important eigenvectors

29
PCA – zero mean
• Suppose we are given x1, x2, ..., xM (N x 1) vectors
N: # of features
Step 1: compute sample mean M: # data
M
1
x
M
x
i 1
i

Step 2: subtract sample mean (i.e., center data at zero)

Φi  xi  x
Step 3: compute the sample covariance matrix Σx
1 M
1 M
1 where A=[Φ1 Φ2 ... ΦΜ]
x 
M

i 1
( x i  x )( x i  x )T

M

i 1
 T
i
i  
M
AAT
i.e., the columns of A are the Φi
(N x M matrix)

30
PCA - Steps
Step 4: compute the eigenvalues/eigenvectors of Σx
 xui  iui
where we assume 1  2  ...  N
Note : most software packages return the eigenvalues (and corresponding eigenvectors)
is decreasing order – if not, you can explicitly put them in this order)

Since Σx is symmetric, <u1,u2,…,uN> form an orthogonal basis

in RN and we can represent any x∈RN as: x 
x 
y 
y 
1 1

 2  2
N  .   . 

x  x   yi ui  y1u1  y2u2  ...  y N u N

   
 .  .
xx:  
 .   . 
i 1    
i.e., this is  .   . 
just a “change”  .   . 
(x  x)T ui    
yi  T
 ( x  x )T
ui if || ui || 1 of basis!  xN   y N 
ui ui
Note : most software packages normalize ui to unit length to simplify calculations; if
not, you can explicitly normalize them) 31
PCA - Steps
Step 5: dimensionality reduction step – approximate x using
only the first K eigenvectors (K<<N) (i.e., corresponding to
the K largest eigenvalues where K is a parameter)

32
Example
• Compute the PCA of the following dataset:

(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)

• Compute the sample covariance matrix is:

• The eigenvalues can be computed by finding the roots of the

characteristic polynomial:

33
Example (cont’d)
• The eigenvectors are the solutions of the systems:
xui  iui

Note: if ui is a solution, then cui is also a solution where c≠0.

Eigenvectors can be normalized to unit-length using:

vi
vˆi 
|| vi ||
34
Choosing the projection dimension K ?

• K is typically chosen based on how much information

(variance) we want to preserve:
K

Choose the smallest  i

K that satisfies
i 1
N
T where T is a threshold (e.g., 0.9)
the following
inequality: 
i 1
i

• If T=0.9, for example, we “preserve” 90% of the information

(variance) in the data.

• If K=N, then we “preserve” 100% of the information in the

data (i.e., just a “change” of basis and xˆ  x )

35
Data Normalization

• The principal components are dependent on the units used

to measure the original variables as well as on the range of
values they assume.

• Data should always be normalized prior to using PCA.

• A common normalization method is to transform all the data

to have zero mean and unit standard deviation:

xi   where μ and σ are the mean and standard

deviation of the i-th feature xi

36

Internal Combustion Engine Fundamentals 2nd Edition
94% (17)
Internal Combustion Engine Fundamentals 2nd Edition
426 pages
Principal Component Analysis and Cluster Analysis
No ratings yet
Principal Component Analysis and Cluster Analysis
14 pages
Introduction C
100% (1)
Introduction C
28 pages
Grade 8 August Holiday Revision Booklet
No ratings yet
Grade 8 August Holiday Revision Booklet
154 pages
Dimensionality Reduction Using PCA (Principal Component Analysis)
No ratings yet
Dimensionality Reduction Using PCA (Principal Component Analysis)
13 pages
Dimensionality Reduction & Models
No ratings yet
Dimensionality Reduction & Models
59 pages
Multivariate Statistical Analysis
No ratings yet
Multivariate Statistical Analysis
12 pages
Design of Water Supply Networks CIVL 5995 Project I
100% (1)
Design of Water Supply Networks CIVL 5995 Project I
36 pages
Engine Diagrams Cummins
100% (3)
Engine Diagrams Cummins
34 pages
Principal Components Analysis (PCA) : R. Jothi
No ratings yet
Principal Components Analysis (PCA) : R. Jothi
47 pages
Ultra Sensitive TSH Test Report
No ratings yet
Ultra Sensitive TSH Test Report
1 page
CH 6
No ratings yet
CH 6
11 pages
1LE2321-1CA11-4GA3 Datasheet en
No ratings yet
1LE2321-1CA11-4GA3 Datasheet en
1 page
Gr-7 Term 1 & 2 Annual Planner 2024-25
No ratings yet
Gr-7 Term 1 & 2 Annual Planner 2024-25
11 pages
Controls Engineering in FRC
No ratings yet
Controls Engineering in FRC
352 pages
5 Dimentionality Reduction
No ratings yet
5 Dimentionality Reduction
27 pages
RES805-RM-Module 2
No ratings yet
RES805-RM-Module 2
26 pages
Data Analysis: Dr. C Santhosh Kumar
No ratings yet
Data Analysis: Dr. C Santhosh Kumar
22 pages
Roof Design
100% (2)
Roof Design
19 pages
PCA Complete
No ratings yet
PCA Complete
8 pages
Principal Computer Analysis (PCA)
No ratings yet
Principal Computer Analysis (PCA)
25 pages
A.S Level Biology Edexcel Notes Unit 1 Part 1 Color 2side
No ratings yet
A.S Level Biology Edexcel Notes Unit 1 Part 1 Color 2side
134 pages
Lecture 9 - Data Prep - Reduction - PCA-M
No ratings yet
Lecture 9 - Data Prep - Reduction - PCA-M
44 pages
ML RUSA Module 5 Dim Red
No ratings yet
ML RUSA Module 5 Dim Red
85 pages
Dimensionality Reduction Techniques in Data Mining Aim To Reduce The Number of Features
No ratings yet
Dimensionality Reduction Techniques in Data Mining Aim To Reduce The Number of Features
9 pages
UploadFile 9116
No ratings yet
UploadFile 9116
21 pages
Principal Components Analysis (PCA)
No ratings yet
Principal Components Analysis (PCA)
27 pages
1501589578da Mod15 Q1 e Text
No ratings yet
1501589578da Mod15 Q1 e Text
9 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
8 pages
ET - W2021 (2131905) (GTURanker - Com)
No ratings yet
ET - W2021 (2131905) (GTURanker - Com)
2 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
27 pages
W4.2 DataPreProcessing-PCA
No ratings yet
W4.2 DataPreProcessing-PCA
22 pages
20 Pca
No ratings yet
20 Pca
50 pages
Concrete Wind Towers 05
No ratings yet
Concrete Wind Towers 05
0 pages
Pattern Recognition (CSE4213) : Principal Components Analysis (PCA)
No ratings yet
Pattern Recognition (CSE4213) : Principal Components Analysis (PCA)
38 pages
Program: Course Code: Course Name:: M.C.A. MCAS9220 Data Science Fundamentals
No ratings yet
Program: Course Code: Course Name:: M.C.A. MCAS9220 Data Science Fundamentals
28 pages
Presentation A I STD 2
No ratings yet
Presentation A I STD 2
63 pages
Principal Component Analysis (PCA)
No ratings yet
Principal Component Analysis (PCA)
18 pages
U4 - PCA - 5th Sem - DS
No ratings yet
U4 - PCA - 5th Sem - DS
14 pages
MLPDF 2
No ratings yet
MLPDF 2
9 pages
IDS 4 (Week 14)
No ratings yet
IDS 4 (Week 14)
66 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
10 pages
Lecture 1 1
No ratings yet
Lecture 1 1
26 pages
Module 5 - BECE309L - AIML - Part2
No ratings yet
Module 5 - BECE309L - AIML - Part2
34 pages
Data Pre-Processing-IV (Feature Extraction-PCA)
No ratings yet
Data Pre-Processing-IV (Feature Extraction-PCA)
23 pages
FALLSEM2024-25 SWE1015 ETH VL2024250103260 2024-09-18 Reference-Material-I
No ratings yet
FALLSEM2024-25 SWE1015 ETH VL2024250103260 2024-09-18 Reference-Material-I
62 pages
Data Analysis: Dr. C Santhosh Kumar
No ratings yet
Data Analysis: Dr. C Santhosh Kumar
22 pages
Need of Principal Component Analysis
No ratings yet
Need of Principal Component Analysis
8 pages
Unit 3
No ratings yet
Unit 3
28 pages
U5@-Data Reduction
No ratings yet
U5@-Data Reduction
22 pages
Pca Kmeans GMM
No ratings yet
Pca Kmeans GMM
96 pages
7.3 Pca
No ratings yet
7.3 Pca
17 pages
Principal Component Analysis Concepts
No ratings yet
Principal Component Analysis Concepts
16 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
34 pages
Outokumpu Stainless Steel Bar Sizes and Specifications
No ratings yet
Outokumpu Stainless Steel Bar Sizes and Specifications
2 pages
S220 Loader Service Guide
No ratings yet
S220 Loader Service Guide
29 pages
3.2 Pca
No ratings yet
3.2 Pca
27 pages
Pattern Recognition PCA: Subrata Datta Dept. of AIML Nsec
No ratings yet
Pattern Recognition PCA: Subrata Datta Dept. of AIML Nsec
19 pages
Dimensionality Reduction Using Principal Component Analysis
No ratings yet
Dimensionality Reduction Using Principal Component Analysis
32 pages
P-3.1.4 - Pca
No ratings yet
P-3.1.4 - Pca
44 pages
Dim Reduction & Pattern Recognition
No ratings yet
Dim Reduction & Pattern Recognition
63 pages
Parental Personality and Parenting Style
No ratings yet
Parental Personality and Parenting Style
13 pages
PCA: Step-by-Step Guide to Dimensionality Reduction
No ratings yet
PCA: Step-by-Step Guide to Dimensionality Reduction
13 pages
Aviation Engine Mechanics Quiz
No ratings yet
Aviation Engine Mechanics Quiz
120 pages
Principal Component Analysis
100% (1)
Principal Component Analysis
10 pages
Qalambartar (QB) For Windows and Mac: 10, 2 M Flower @
No ratings yet
Qalambartar (QB) For Windows and Mac: 10, 2 M Flower @
3 pages
Dimensionality Reduction by Pca: Non - Feasible
No ratings yet
Dimensionality Reduction by Pca: Non - Feasible
26 pages
Principal Component Analysis Concepts
No ratings yet
Principal Component Analysis Concepts
16 pages
Singular Value Decomposition (SVD) / Principal Components Analysis (Pca)
No ratings yet
Singular Value Decomposition (SVD) / Principal Components Analysis (Pca)
31 pages
Regression Analysis Q&A Guide
No ratings yet
Regression Analysis Q&A Guide
2 pages
PCA for Advanced Statistics Students
No ratings yet
PCA for Advanced Statistics Students
41 pages
Rectifier-RM2048XE PDF
No ratings yet
Rectifier-RM2048XE PDF
2 pages
Blockholders' Power & Firm Value
No ratings yet
Blockholders' Power & Firm Value
13 pages
DT-10 Owner's Manual: Turning On The Power
No ratings yet
DT-10 Owner's Manual: Turning On The Power
3 pages
Principal Component Analysis Guide
No ratings yet
Principal Component Analysis Guide
23 pages
Modified Compressed Air Engine Two Stroke Engine Working On The Design of A Four Stroke Petrol Engine
No ratings yet
Modified Compressed Air Engine Two Stroke Engine Working On The Design of A Four Stroke Petrol Engine
3 pages
NetSDK Programming Manual
No ratings yet
NetSDK Programming Manual
49 pages
PCA Finds Representation Through Linear Transformation
No ratings yet
PCA Finds Representation Through Linear Transformation
28 pages
The Geek Way Andrew Mcafee Reid Hoffman Download
100% (1)
The Geek Way Andrew Mcafee Reid Hoffman Download
40 pages
Handbook of Shanti Swarup Bhatnagar Prize Winners (1958 - 1998)
No ratings yet
Handbook of Shanti Swarup Bhatnagar Prize Winners (1958 - 1998)
118 pages
Activity 1.6
No ratings yet
Activity 1.6
3 pages
Presentation
No ratings yet
Presentation
31 pages
10-601 Machine Learning (Fall 2010) Principal Component Analysis
No ratings yet
10-601 Machine Learning (Fall 2010) Principal Component Analysis
8 pages
Projecting Data To A Lower Dimension With PCA
No ratings yet
Projecting Data To A Lower Dimension With PCA
6 pages
PCA for Data Simplification
No ratings yet
PCA for Data Simplification
70 pages
AAMRA
100% (1)
AAMRA
50 pages
Skewb Puzzle Solving Guide
No ratings yet
Skewb Puzzle Solving Guide
12 pages
PCA
100% (1)
PCA
33 pages

Lecture 9 - Data Reduction

Uploaded by

Lecture 9 - Data Reduction

Uploaded by

Transfer Functions

 Data Preprocessing: An Overview

 Major Tasks in Data Preprocessing

 Data Transformation and Data Discretization

 Data reduction: Obtain a reduced representation of the data set that is

 Why data reduction? — A database/data warehouse may store terabytes of

 Data reduction strategies

 Data reduction strategies

 Dimensionality reduction, e.g., remove unimportant attributes

 Principal Components Analysis (PCA)

 Feature subset selection, feature creation

 Numerosity reduction (some simply call it: Data Reduction)

 Histograms, clustering, sampling

 Data cube aggregation

 - 2D: Bivariate plot (i.e. X-Y plane)

 - 3D: X-Y-Z plot

 - 4D: ternary plot with a color code /Tetrahedron- 5D, 6D,

• Given data points in d dimensions

 This technique is based on the reduced dimensionality.

 A principal component can be defined as a linear

• Orthogonal directions of greatest variance in data

• minimizes mean squared distance between

• Principal component #1 points

• Each subsequent principal component…

• Mathematically, it determines the eigenvectors of the

• Standard deviation, Variance, Covariance

• The Covariance matrix

• Eigenvalues and Eigenvectors

• Standard Deviation and Variance are 1-dimensional

• Covariance measures between 2 dimensions

We easily see, if X=Y we end up with variance

• Let X be a random vector.

• Then the covariance matrix of X, denoted by Cov(X), is

• The diagonals of Cov(X) are .

The covariance matrix is symmetric

1.5 <v1,v2> = <(1 0),(0 1)>

0.5 1.0 1.5

• Unit vectors which are orthogonal are said to be orthonormal.

Find corresponding eigenvectors

• Looking for a transformation of the data matrix X (pxn) such that

Y= T X=1 X1+ 2 X2+..+ p Xp

What is a reasonable choice for the  ?

Remember: We wanted a transformation that maximizes information

That means: captures Variance in the data

Maximize the variance of the projection of the observations on the Y

The matrix C=Var(X) is the covariance matrix of the Xi variables

• PCA basis vectors = Compute the eigenvectors of 

• Larger eigenvalue  more important eigenvectors

Step 2: subtract sample mean (i.e., center data at zero)

Since Σx is symmetric, <u1,u2,…,uN> form an orthogonal basis

x  x   yi ui  y1u1  y2u2  ...  y N u N

• Compute the sample covariance matrix is:

• The eigenvalues can be computed by finding the roots of the

Note: if ui is a solution, then cui is also a solution where c≠0.

Eigenvectors can be normalized to unit-length using:

• K is typically chosen based on how much information

Choose the smallest  i

• If T=0.9, for example, we “preserve” 90% of the information

• If K=N, then we “preserve” 100% of the information in the

• The principal components are dependent on the units used

• Data should always be normalized prior to using PCA.

• A common normalization method is to transform all the data

xi   where μ and σ are the mean and standard

You might also like