FEM 2063 - Data Analytics
CHAPTER 5
At the end of this chapter
students should be able to
understand:
Dimensionality
Reduction Methods
1
OVERVIEW
➢Singular Value Decomposition (SVD)
➢Principal Components Analysis (PCA)
2
5.0 Unsupervised Learning – Dimension
Reduction
5.0 Unsupervised Learning
Datasets in the form of matrices
We are given n objects and p features describing
the objects.
Dataset
An n-by-p matrix A
n rows representing n objects
Each object has p numeric values describing it.
Goal
1. Understand the structure of the data, e.g., the
underlying process generating the data.
2. Reduce the number of features representing
the data
5.0 Unsupervised Learning
Example - Market basket
matrices
p products (e.g., milk, bread, rice, etc.)
---------------------------------------
-------------------------------------------- Aij= quantity of j-th product
n customers purchased by the i-th
A customer
Aim: find a subset of the products that characterize customer
behavior
5.0 Unsupervised Learning
Dimensionality reduction method:
• Singular Value Decomposition (SVD)
• Principal Components Analysis (PCA)
• Canonical Correlation Analysis (CCA)
• Multi-dimensional scaling (MDS)
• Independent component analysis (ICA)
Overview
➢Singular Value Decomposition (SVD)
➢Principal Components Analysis (PCA)
7
SVD – general overview
The singular value decomposition (SVD) provides another way to factorize a
matrix, into singular vectors and singular values. The SVD is used widely both
in the calculation of other matrix operations, such as matrix inverse, but also as a
data reduction method in machine learning. Data matrices have n rows (one for
each object) and p columns (one for each feature).
Data
Matrix
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 8 8
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright
Internalowner.
SVD – general overview
Singular Value Decomposition (SVD) is a widely used technique to decompose
a matrix into several component matrices, exposing many of the useful and
interesting properties of the original matrix.
n x n matrix
m x n matrix *Rows of VT = Right Singular Value
Data matrix *Column of V = Orthonormal
eigenvector ATA
n
m x m matrix m x n diagonal matrix
*Column of U = Left singular *Nonzero value in diagonal = Singular
n
value Value
*Orthonormal eigenvector *Diagonal matrix = square roots of
n
m AAT eigenvalue of U and V in descending
order VT
m
m
S
U
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 9 9
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright
Internalowner.
SVD – general overview
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 10
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright
Internalowner.
SVD - Singular values What are singular values in a matrix?
The singular values are the diagonal entries of
the S matrix and are arranged in descending
order. The singular values are always real
numbers. If the matrix A is a real matrix, then U and
V are also real. The values of x1 and x2 are chosen
such that the elements of the S are the square roots
of the eigenvalues.
1: measures how much of the data variance is
explained by the first singular vector.
1st (right) singular vector: direction of maximal
variance
2: measures how much of the data variance is
explained by the second singular vector.
2nd (right) singular vector: direction of maximal
variance, after removing the projection of the data
along the first singular vector.
Why SVD
Developing SVD – step by step
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 13
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
SVD – Example
Note:
5 – most preferred
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
0may–benot
All rights reserved. 14
No part of this document preferred
reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
SVD – Example
Represent 53.44%
User 1 of the dataset
User 2 Represent 40.95%
of the dataset
User 3
User 4
User 5
Represent 5.61% of
User 6 the dataset
User 7
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 15
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
SVD – Example (Users-to-Movies)
Represent 90% of
User 1 the dataset
User 2
User 3
User 4
User 5
User 6
User 7
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 16
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
SVD – Example (Users-to-Movies)
Conclusion: It is observed that the amount percentage variance explained (explained variance ratio) by each principal component
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
17
All rights reserved.
has different value: 53.44% of the variance is explained by the 1st component, 40.95% by the 2nd component, 15.46%
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
SVD – Example (Users to Movies)
What can you
observe
between
original matrix
and matrix after
SVD?
Original Matrix Matrix after SVD
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 18
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
SVD Applications
➢ Using SVD in computation rather than A has the advantage of being more
robust to numerical error
➢ SVD usually found by iterative methods
➢ Other applications:
➢ Inverse of matrix A
➢ Conditions of matrix
➢ Image compression
➢ Solve Ax=b for all cases (unique, many, no solutions)
➢ Rank determination, matrix approximation
Note: Rank of a matrix is defined as
(a) the maximum number of linearly independent column vectors in the matrix or
(b) the maximum number of linearly independent row vectors in the matrix. Both
definitions are equivalent.
For an r x c matrix, If r is less than c, then the maximum rank of the matrix is r.
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 19
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
SVD Applications and Example
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 20
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
How SVD is used in image compression?
SVD Applications and Example In this method, digital image is given to
SVD. SVD refactors the given digital
image into three matrices. Singular values
are used to refactor the image and at the end
of this process, image is represented with
smaller set of values, hence reducing the
storage space required by the image.
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 21
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
SVD
Example computation:
SVD
Example computation:
5.2 SVD - Advantages
➢SVD is stable, small change in the input results small change in
the singular matrix
➢Compression speed in SVD is also high
➢Decomposition provides low rank approximation to A
➢There exist efficient, stable algorithms to compute the SVD
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 24
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
Overview
➢Singular Value Decomposition (SVD)
➢Principal Components Analysis (PCA)
25
Principal Component Analysis (PCA)
Principal Component Analysis, PCA is a “dimensionality reduction” method. It
reduces the number of variables that are correlated to each other into fewer
independent variables without losing the essence of these variables. It
provides an overview of linear relationships between inputs and variables.
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 26 26
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright
Internalowner.
Principal Component Analysis (PCA)
What is PCA?
PCA is a statistical procedure that converts a set What is PC1 and PC2 in
of observations of possibly correlated variables
into a set of values of linearly uncorrelated
PCA?
Principal components are
variables called principal components. PCA
created in order of the amount of
is often used to simplify data, reduce noise,
variation they cover: PC1
and find unmeasured “latent variables”. It finds
captures the most variation,
directions of maximal variance of data,
PC2, the second most, and so
directions that are mutually orthogonal. The
on. Each of them contributes
relationship between variance and information is
some information of the data,
the larger the variance carried by a line, the
and in a PCA, there are as many
larger the dispersion of the data points along it,
principal components as there
and the larger the dispersion along a line, the
are characteristics.
more the information it has.
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 27
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright
Internalowner.
Principal Component Analysis (PCA)
Why PCA?
PCA is a popular technique for analyzing large
datasets containing a high number of
dimensions/features per observation, increasing the
interpretability of data while preserving the
maximum amount of information, and enabling
the visualization of multidimensional data.
SVD vs PCA
What are the differences/similarities between SVD and PCA?
• SVD and PCA are two eigenvalue methods used to reduce a high-dimensional
data set into fewer dimensions while retaining important information.
As PCA uses the SVD in its calculation, clearly there is some 'extra' analysis
done.
• SVD gives you the whole nine-yard of diagonalizing a matrix into special
matrices that are easy to manipulate and to analyze. It lay down the foundation
to untangle data into independent components. PCA skips less significant
components.
SVD vs PCA
Now you
may
compare the
scores.
Any idea?
SVD vs PCA
PCA scores are
-mean centred
- uncorrelated
5.3 How to construct PCA
Steps:
1. Scale / Normalize the Data (A)
2. Calculate covariance matrix from the dataset.
3. Find the eigenvalue and eigenvectors from the covariance matrix.
4. Find the PC in the covariance matrix using SVD. A=USV’ >>> AV=US =PC,
Principal Components
5. Compare the variance of each PC in the dataset.
6. The highest variance indicated the suitable PC to perform in the modeling.
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 32
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
PCA – Assumptions
Assumptions of PCA
1. Independent variables are highly correlated to each other. The correlation
coefficient, r, tells us about the strength and direction of the linear
relationships between variables. In the case of more than two variables,
use the correlation matrix.
2. Variables included are metric level or nominal level.
3. Features are low dimensional in nature.
4. Independent variables are numeric in nature.
When to use PCA?
• Whenever we want to ensure that variables in data are independent to
each other.
• When we want to reduce the number of variables in a data set with many
variables in it.
• When we want to interpret data and variable selection out of it.
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 33
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
How to construct PCA? STEP 1: Scale / Normalize the Data (A)
Do I need to scale before PCA?
• PCA is sensitive to scale
Yes, it is necessary to normalize data before
• PCA should be applied on
performing PCA. The PCA calculates a new projection
data that have approximately
of data set. After normalizing the data, all variables have
the same scale in each
the same standard deviation, thus all variables have the
variable.
same weight and the PCA calculates relevant axis
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 34
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
How to construct PCA? STEP 2: Covariance Matrix Computation
The aim of this step is to understand how the variables of the input data set
are varying from the mean with respect to each other. Variables are highly
correlated in such a way that they contain redundant information. So, in
order to identify these correlations, we compute the covariance matrix.
• Can variance tell the data is in difference orientation? No.
• The data orientation can be answered by covariance.
• The covariance can show the positive and negative correlation.
Covariance” indicates
the direction of the
linear relationship
between variables.
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 35
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
How to construct PCA?
STEP 3: Compute the Eigenvectors and Eigenvalues of the Covariance
Matrix by Identify the Principal Components
Eigenvectors and eigenvalues are computed from the covariance matrix in order
to determine the principal components of the data. Information in principal
components will allow to reduce dimensionality without losing much information,
and this by discarding the components with low information and considering
the remaining components as your new variables.
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 36
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
PCA - Example
Understanding USArrests data using PCA by Hemang Goswami
This data set contains arrests per 100,000 residents for assault,
murder, and rape in each of the 50 US states. Also given is the
percent of the population living in urban areas.
A data frame with 50 observations on 4 variables.
•Murder numeric Murder arrests (per 100,000)
•Assault numeric Assault arrests (per 100,000)
•UrbanPop numeric Percent urban population
•Rape numeric Rape arrests (per 100,000)
The size of the matrix is 50 states x 4 variables – which is
very large to be processed if it is not reduced.
Link: https://rstudio-pubs-static.s3.amazonaws.com/377338_75ed92a8463d482a80045abcae0e395d.html
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 37
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
PCA - Example
Data Structure Data values are not in the same scales and the means,
medians and variances are not in the same range
Data Summary
Link: https://rstudio-pubs-static.s3.amazonaws.com/377338_75ed92a8463d482a80045abcae0e395d.html
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 38
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
PCA - Example
Correlation Matrix Correlation Matrix
Before Scaling After Scaling
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 39
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
PCA - Example
Principal components
Mean Variance
The rotation matrix provides the principal component loadings vector.
We create the principal components for the four variables to explain the variance vectors
in the dataset without including the correlation between variables (PC1,PC2,PC 3, and
PC 4).The rotation matrix provides the principal component loadings vector.
The amount of variance explained by each principal component:
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 40
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
PCA - Choosing the number of required Principal components
The percentage of variance explained by each principal component:
62% of the variance is explained by the first principal component, 25% by the second
principal component, 9% by the third principal component and the remaining 4% by
the last principal component.
Hence a large proportion of the variance is explained by the first 2 principal components.
The points after which the variation explained starts to drop off is called as the elbow
point. A fair amount of variance is explained by the first two principal components, and
that there is an elbow after the second component. Third principal component explained
less than 10% variance and the last was almost negligible. Hence, we decide to go with
two principal components
Elbow point
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 41
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
PCA - Example
Creating 2 principal components
PC values here represents
vector for the components
Checking the weights of 2 principal components, we see that:
•The first loading vector places approximately equal weight on Assault, Murder, and
Rape, with much less weight on UrbanPop. Hence this component roughly
corresponds to a measure of overall rates of serious crimes.
•The second loading vector places most of its weight on UrbanPop and much less
weight on the other three features. Hence, this component roughly corresponds to
the level of urbanization of the state.
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 42
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
PCA - Example
The biplot shows that 50 states mapped to the 2
principal components. The vectors of the PCA for 4
variables are also plotted.
•The large positive scores on the first component,
such as California, Nevada and Florida, have high crime
rates, while states like North Dakota, with negative
scores on the first component, have low crime rates.
•California also has a high score on the second
component, indicating a high level of urbanization,
while the opposite is true for states like Mississippi.
•States close to zero on both components, such as
Indiana, have approximately average levels of both
crime and urbanization.
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 43
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
PCA - Example Checking the principal components scores
vector for all 50 states: basis vector multiply
with original data (scaled data)
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 44
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
PCA - Example
Modeling a Data Set
Data set of foods commonly consumed in
different European countries. Figure displays
the score plot of the first two principal
components called t1 and t2. The score plot is
a map of 16 countries. Countries close to each
other have similar food consumption profiles,
whereas those far from each other are
dissimilar. The Nordic countries (Finland,
Norway, Denmark and Sweden) are located
together in the upper right-hand corner, thus
representing a group of nations with some The PCA score plot of the first two PCs of a data set
similarity in food consumption. about food consumption profiles. This provides a map of
Belgium and Germany are close to the center how the countries relate to each other. The first
(origin) of the plot, which indicates they have component explains 32% of the variation, and the
second component 19%. Colored by geographic
average properties. location (latitude) of the respective capital city.
PCA - Example
The distance to the origin also conveys information. The further away from the plot origin a variable
lies, the stronger the impact that variable has on the model. This means, for instance, that the
variables crisp bread (Crisp_br), frozen fish (Fro_Fish), frozen vegetables (Fro_Veg) and garlic
(Garlic) separate the four Nordic countries from the others. The four Nordic countries are
characterized as having high values (high consumption) of the former three provisions, and low
consumption of garlic. Moreover, the model interpretation suggests that countries like Italy, Portugal,
Spain and to some extent, Austria have high consumption of garlic, and low consumption of
sweetener, tinned soup (Ti_soup) and tinned fruit (Ti_Fruit).
PCA - Applications
1. Image compression. Image can be resized as per the requirement and patterns
can be determined.
2. Customer profiling based on demographics as well as their intellect in the
purchase.
3. Widely used by researchers in the food science field.
4. Banking field in many areas like applicants applied for loans, credit cards, etc.
5. Customer Perception towards brands.
6. Finance field to analyze stocks quantitatively, forecasting portfolio returns, also
in the interest rate implantation.
7.Healthcare industries in multiple areas like patient insurance data where there
are multiple sources of data and with a huge number of variables
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 47
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright owner.
PCA - Summary
Principal component analysis, or PCA, is a statistical procedure that allows you
to summarize the information content in large data tables by means of a smaller
set of “summary indices” that can be more easily visualized and analyzed.
Strength
•Easy to compute.
•Speeds up other machine learning algorithms.
•Counteracts the issues of high-dimensional
data.
Limitations
PCA is sensitive to outliers. Such data inputs
could produce results that are very much off
the correct projection of the data. PCA presents
limitations when it comes to interpretability.
Since we're transforming the data, features
lose their original meaning.
Limitations of PCA
➢ If the data does not follow a
multidimensional normal
(Gaussian) distribution, PCA
may not give the best principal
components
➢ Cannot fit data that is not linear
➢ The direction of maximum
variance is not always good for
classification
➢ if the data is a set of strings.
• (1,0,0,0,...), (0,1,0,0...),
...,(0,0,0,...,1) then the
eigenvalues do not fall off as PCA
requires.
© 2021 UNIVERSITI TEKNOLOGI PETRONAS
All rights reserved. 49
No part of this document may be reproduced, stored in a retrieval system or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise)
without the permission of the copyright
Internalowner.
50