Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views183 pages

Dimensionality Reduction

The document discusses dimensionality reduction in machine learning, emphasizing that high-dimensional data often lies on lower-dimensional manifolds. It outlines the advantages of dimensionality reduction, including improved visualization, computational efficiency, and better performance. The document also introduces various techniques for dimensionality reduction, such as Principal Component Analysis (PCA) and other manifold learning methods.

Uploaded by

eswam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views183 pages

Dimensionality Reduction

The document discusses dimensionality reduction in machine learning, emphasizing that high-dimensional data often lies on lower-dimensional manifolds. It outlines the advantages of dimensionality reduction, including improved visualization, computational efficiency, and better performance. The document also introduces various techniques for dimensionality reduction, such as Principal Component Analysis (PCA) and other manifold learning methods.

Uploaded by

eswam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 183

Topic 2: DIMENSIONALITY REDUCTION

STAT 37710/CAAM 37710/CMSC 35400 Machine Learning


Risi Kondor, The University of Chicago
Dimensionality reduction

In ML data points are often represented as high dimensional real valued


vectors
x = (x1 , x1 , x3 , . . . , xd )⊤ ∈ Rd .

2
2/69
/69
Dimensionality reduction

In ML data points are often represented as high dimensional real valued


vectors
x = (x1 , x1 , x3 , . . . , xd )⊤ ∈ Rd .
The individual dimensions are called features (attributes).

2
2/69
/69
Dimensionality reduction

In ML data points are often represented as high dimensional real valued


vectors
x = (x1 , x1 , x3 , . . . , xd )⊤ ∈ Rd .
The individual dimensions are called features (attributes).

Example: Pixels of an image, a music file, etc.

2
2/69
/69
Dimensionality reduction

In ML data points are often represented as high dimensional real valued


vectors
x = (x1 , x1 , x3 , . . . , xd )⊤ ∈ Rd .
The individual dimensions are called features (attributes).

Example: Pixels of an image, a music file, etc.

But is the problem intrinsically high dimensional???

2
2/69
/69
Dimensionality reduction

In ML data points are often represented as high dimensional real valued


vectors
x = (x1 , x1 , x3 , . . . , xd )⊤ ∈ Rd .
The individual dimensions are called features (attributes).

Example: Pixels of an image, a music file, etc.

But is the problem intrinsically high dimensional??? Often we can convert


high dimensional problems to lower dimensional ones without losing too
much information.

2
2/69
/69
Dimensionality reduction

• Real world data often lie on or near lower dimensional structures


(manifolds).

3
3/69
/69
Dimensionality reduction

• Real world data often lie on or near lower dimensional structures


(manifolds). (Really?)

3
3/69
/69
Dimensionality reduction

• Real world data often lie on or near lower dimensional structures


(manifolds). (Really?)
◦ Variables (features) may be correlated or dependent.

3
3/69
/69
Dimensionality reduction

• Real world data often lie on or near lower dimensional structures


(manifolds). (Really?)
◦ Variables (features) may be correlated or dependent.
◦ Physical systems have a small number of degrees of freedom (e.g., pose and
lighting in Vision).

3
3/69
/69
Dimensionality reduction

• Real world data often lie on or near lower dimensional structures


(manifolds). (Really?)
◦ Variables (features) may be correlated or dependent.
◦ Physical systems have a small number of degrees of freedom (e.g., pose and
lighting in Vision).
• IDEA: find the manifold and restrict learning algorithm to it.

3
3/69
/69
Differentiable manifolds

In mathematics, a d dimensional manifold is a topological space such that


each point has a neighborhood that is homeomorphic to Rd . A differentiable
manifold has additional structure, and a Riemannian manifold has a metric
too → geodesics.

4
4/69
/69
Dimensionality reduction

Advantages:
• Visualization: humans can only imagine things in 2D or 3D.

5
5/69
/69
Dimensionality reduction

Advantages:
• Visualization: humans can only imagine things in 2D or 3D.
• Computational efficiency: learning algorithms work faster in low
dimensions.

5
5/69
/69
Dimensionality reduction

Advantages:
• Visualization: humans can only imagine things in 2D or 3D.
• Computational efficiency: learning algorithms work faster in low
dimensions.
• Better performance: the projection might eliminate noise.

5
5/69
/69
Dimensionality reduction

Advantages:
• Visualization: humans can only imagine things in 2D or 3D.
• Computational efficiency: learning algorithms work faster in low
dimensions.
• Better performance: the projection might eliminate noise.
• Interpretability: the vectors spanning the subspace might have
interesting interpretations.

5
5/69
/69
Dimensionality reduction

Dimensionality reduction is a typical unsupervised learning task.

6
6/69
/69
Dimensionality reduction

Dimensionality reduction is a typical unsupervised learning task. Two


types:
• Linear:

6
6/69
/69
Dimensionality reduction

Dimensionality reduction is a typical unsupervised learning task. Two


types:
• Linear:
◦ Principal Component Analysis (PCA)

6
6/69
/69
Dimensionality reduction

Dimensionality reduction is a typical unsupervised learning task. Two


types:
• Linear:
◦ Principal Component Analysis (PCA)
• Nonlinear (“manifold learning”):

6
6/69
/69
Dimensionality reduction

Dimensionality reduction is a typical unsupervised learning task. Two


types:
• Linear:
◦ Principal Component Analysis (PCA)
• Nonlinear (“manifold learning”):
◦ Multidimensional scaling

6
6/69
/69
Dimensionality reduction

Dimensionality reduction is a typical unsupervised learning task. Two


types:
• Linear:
◦ Principal Component Analysis (PCA)
• Nonlinear (“manifold learning”):
◦ Multidimensional scaling
◦ Locally linear embedding

6
6/69
/69
Dimensionality reduction

Dimensionality reduction is a typical unsupervised learning task. Two


types:
• Linear:
◦ Principal Component Analysis (PCA)
• Nonlinear (“manifold learning”):
◦ Multidimensional scaling
◦ Locally linear embedding
◦ Isomap

6
6/69
/69
Dimensionality reduction

Dimensionality reduction is a typical unsupervised learning task. Two


types:
• Linear:
◦ Principal Component Analysis (PCA)
• Nonlinear (“manifold learning”):
◦ Multidimensional scaling
◦ Locally linear embedding
◦ Isomap
◦ Laplacian Eigenmaps

6
6/69
/69
Dimensionality reduction

Dimensionality reduction is a typical unsupervised learning task. Two


types:
• Linear:
◦ Principal Component Analysis (PCA)
• Nonlinear (“manifold learning”):
◦ Multidimensional scaling
◦ Locally linear embedding
◦ Isomap
◦ Laplacian Eigenmaps
◦ Stochastic neighbor embedding

6
6/69
/69
Dimensionality reduction

Dimensionality reduction is a typical unsupervised learning task. Two


types:
• Linear:
◦ Principal Component Analysis (PCA)
• Nonlinear (“manifold learning”):
◦ Multidimensional scaling
◦ Locally linear embedding
◦ Isomap
◦ Laplacian Eigenmaps
◦ Stochastic neighbor embedding
◦ etc.

6
6/69
/69
Fact 1

If a matrix A ∈ Rd×d is symmetric, then its (normalized) eigenvectors


v1 , . . . , vd form an orthonormal basis for Rd .

7
7/69
/69
Fact 1

If a matrix A ∈ Rd×d is symmetric, then its (normalized) eigenvectors


v1 , . . . , vd form an orthonormal basis for Rd .

Note: If the eigenvalues are not distinct, then the eigenvectors are not
unique. However, there is always some choice of eigenvectors which forms
an orthonormal basis.

7
7/69
/69
Fact 2 (Rayleigh quotient)

Let v1 , . . . , vd be the normalized eigenvectors of a symmetric matrix


A ∈ Rd×d and let λ1 < λ2 < . . . < λd be the corresponding eigenvalues.
Then
w⊤A w
argmin = v1 .
w ∈ Rd \{0} k w k2

8
8/69
/69
Fact 2 (Rayleigh quotient)

Let v1 , . . . , vd be the normalized eigenvectors of a symmetric matrix


A ∈ Rd×d and let λ1 < λ2 < . . . < λd be the corresponding eigenvalues.
Then
w⊤A w
argmin = v1 .
w ∈ Rd \{0} k w k2

Similarly,
w⊤A w
argmax = vd .
w ∈ Rd \{0} k w k2

8
8/69
/69
Principal Component Analysis
The principal directions in data

10
10/69
/69
Finding the principal subspace

How can we find the most relevant subspace for the data?

11
11/69
/69
Finding the principal subspace

How can we find the most relevant subspace for the data? By finding a basis
for it. The individual basis vectors are called the principal components.

11
/69
11/69
The first principal component

Given a data set {x1 , x2 , . . . , xn } of n vectors in Rd , what is the direction


that is most informative for this data?

12
/69
12/69
The first principal component

Given a data set {x1 , x2 , . . . , xn } of n vectors in Rd , what is the direction


that is most informative for this data?
Pn
1. First center the data: xi ← xi − µ where µ = n1 i=1 xi .

12
/69
12/69
The first principal component

Given a data set {x1 , x2 , . . . , xn } of n vectors in Rd , what is the direction


that is most informative for this data?
Pn
1. First center the data: xi ← xi − µ where µ = n1 i=1 xi .
2. Find the unit vector p1 that is the solution to

1X
n
p1 = arg max ( xi · v) 2 . (1)
∥v∥=1 n
i=1

12
/69
12/69
The first principal component

Given a data set {x1 , x2 , . . . , xn } of n vectors in Rd , what is the direction


that is most informative for this data?
Pn
1. First center the data: xi ← xi − µ where µ = n1 i=1 xi .
2. Find the unit vector p1 that is the solution to

1X
n
p1 = arg max ( xi · v) 2 . (1)
∥v∥=1 n
i=1

This vector is called the first principal component of the data.

12
/69
12/69
The first principal component
Theorem. The first principal component, p1 , is the eigenvector vd of the
sample covariance matrix
X n
b= 1
Σ xi x⊤
i
n
i=1
with largest eigenvalue.

13
13/69
/69
The first principal component
Theorem. The first principal component, p1 , is the eigenvector vd of the
sample covariance matrix
X n
b= 1
Σ xi x⊤
i
n
i=1
with largest eigenvalue.

Proof.
1X
n
(xi · v)2 =
n
i=1

13
13/69
/69
The first principal component
Theorem. The first principal component, p1 , is the eigenvector vd of the
sample covariance matrix
X n
b= 1
Σ xi x⊤
i
n
i=1
with largest eigenvalue.

Proof.
1X 1X ⊤
n n
(xi · v)2 = (v xi )(x⊤
i v) =
n n
i=1 i=1

13
13/69
/69
The first principal component
Theorem. The first principal component, p1 , is the eigenvector vd of the
sample covariance matrix
X n
b= 1
Σ xi x⊤
i
n
i=1
with largest eigenvalue.

Proof.
1X 1X ⊤ 1X ⊤
n n n
(xi · v)2 = (v xi )(x⊤
i v ) = v ( xi x⊤
i )v =
n n n
i=1 i=1 i=1

13
13/69
/69
The first principal component
Theorem. The first principal component, p1 , is the eigenvector vd of the
sample covariance matrix
X n
b= 1
Σ xi x⊤
i
n
i=1
with largest eigenvalue.

Proof.
1X 1X ⊤ 1X ⊤
n n n
(xi · v)2 = (v xi )(x⊤ i v ) = v ( xi x⊤
i )v =
n n n
i=1 i=1 i=1
1 ⊤ P n 
⊤ v=
v i=1 xi xi
n

13
13/69
/69
The first principal component
Theorem. The first principal component, p1 , is the eigenvector vd of the
sample covariance matrix
X n
b= 1
Σ xi x⊤
i
n
i=1
with largest eigenvalue.

Proof.
1X 1X ⊤ 1X ⊤
n n n
(xi · v)2 = (v xi )(x⊤ i v ) = v ( xi x⊤
i )v =
n n n
i=1 i=1 i=1
1 ⊤ P n 
⊤ v = v⊤ Σ b v.
v i=1 xi xi
n

13
13/69
/69
The first principal component
Theorem. The first principal component, p1 , is the eigenvector vd of the
sample covariance matrix
X n
b= 1
Σ xi x⊤
i
n
i=1
with largest eigenvalue.

Proof.
1X 1X ⊤ 1X ⊤
n n n
(xi · v)2 = (v xi )(x⊤ i v ) = v ( xi x⊤
i )v =
n n n
i=1 i=1 i=1
1 ⊤ P n 
⊤ v = v⊤ Σ b v.
v i=1 xi xi
n
Since kvk = 1 , (1) is equivalent to the Rayleigh quotient optimization
problem
bv
v⊤Σ
p1 = arg max ,
v∈Rd \{0} kvk
so p1 is indeed the eigenvector vd of A with largest eigenvalue. 13
/69
13/69
Further principal components

b can be written as
Recall that Σ

X
d
b=
Σ λ i vi v⊤
i .
i=1

14
14/69
/69
Further principal components

b can be written as
Recall that Σ

X
d
b=
Σ λ i vi v⊤
i .
i=1

After we’ve found the first principal component p1 = vd , project the data to
span {v1 , . . . , vd−1 } .

14
14/69
/69
Further principal components

b can be written as
Recall that Σ

X
d
b=
Σ λ i vi v⊤
i .
i=1

After we’ve found the first principal component p1 = vd , project the data to
span {v1 , . . . , vd−1 } . This just removes λd vd v⊤
d from the sum. So the
second principal component is p2 = vd−1 , and so on.

14
14/69
/69
DNA data

[Matthew Stephens, John Novembre]


15
/69
15/69
Eigenfaces

[Christopher de Coro]
16
/69
16/69
Reconstruction from eigenfaces

[Christopher de Coro]
17
/69
17/69
Example: digits

• Often the eigenvalues drop off


rapidly (e.g., exponentially)

18
18/69
/69
Example: digits

• Often the eigenvalues drop off


rapidly (e.g., exponentially)
• Sometimes there is a sharp drop
somewhere, called the spectral
gap → natural place to put
cut-off

[Source: Peter Orbanz]


18
18/69
/69
Summary of PCA

Advantages:
• Finds best projection

19
19/69
/69
Summary of PCA

Advantages:
• Finds best projection
• Rotationally invariant

19
19/69
/69
Summary of PCA

Advantages:
• Finds best projection
• Rotationally invariant

Disadvantages:
• Full PCA is expensive to compute

19
19/69
/69
Summary of PCA

Advantages:
• Finds best projection
• Rotationally invariant

Disadvantages:
• Full PCA is expensive to compute
• Components not sparse

19
19/69
/69
Summary of PCA

Advantages:
• Finds best projection
• Rotationally invariant

Disadvantages:
• Full PCA is expensive to compute
• Components not sparse
• Sensitive to outliers

19
19/69
/69
Summary of PCA

Advantages:
• Finds best projection
• Rotationally invariant

Disadvantages:
• Full PCA is expensive to compute
• Components not sparse
• Sensitive to outliers
• Linear

19
19/69
/69
NONLINEAR DIMENSIONALITY REDUCTION
• If the data lies close to a linear subspace of Rd , PCA can find it.

21
21/69
/69
• If the data lies close to a linear subspace of Rd , PCA can find it.
• But what if the data lies on a nonlinear manifold?

21
21/69
/69
• If the data lies close to a linear subspace of Rd , PCA can find it.
• But what if the data lies on a nonlinear manifold? Data which at first
looks very high dimensional often really has low dimensional structure.

21
21/69
/69
22
22/69
/69
General principle

Find a map ϕ : Rd → Rp that maps the manifold to a lower dimensional


Euclidean space in a way that preserves local distances as much as possible
(some methods can only map individual data points not the whole of Rd ).

23
/69
23/69
General principle

Find a map ϕ : Rd → Rp that maps the manifold to a lower dimensional


Euclidean space in a way that preserves local distances as much as possible
(some methods can only map individual data points not the whole of Rd ).

Question: Can this always be done?


23
/69
23/69
General principle

Find a map ϕ : Rd → Rp that maps the manifold to a lower dimensional


Euclidean space in a way that preserves local distances as much as possible
(some methods can only map individual data points not the whole of Rd ).

Question: Can this always be done? Depends on the topology.


23
/69
23/69
Methods

• Multidimensional Scaling

24
24/69
/69
Methods

• Multidimensional Scaling
• Isomap

24
24/69
/69
Methods

• Multidimensional Scaling
• Isomap
• Locally Linear Embedding

24
24/69
/69
Methods

• Multidimensional Scaling
• Isomap
• Locally Linear Embedding
• Laplacian Eigenmaps

24
24/69
/69
Methods

• Multidimensional Scaling
• Isomap
• Locally Linear Embedding
• Laplacian Eigenmaps
• SNE, etc..

24
24/69
/69
Multidimensional scaling (MDS)
Classical MDS

• Input: n data points x1 , . . . , xn ∈ Rd .

26
26/69
/69
Classical MDS

• Input: n data points x1 , . . . , xn ∈ Rd .


• Output: n corresponding lower dimensional points y1 , . . . , yn ∈ Rp
(with p  d ) that minimize the so-called strain
X
ECMDS = kD − D∗ k2Frob = ∗ 2
(Di,j − Di,j ) ,
i,j

2 ∗ = ky − y k2 .
where Di,j = kxi − xj k and Di,j i j

26
26/69
/69
The Gram matrix

The Gram matrix of {x1 , . . . , xn } is


the n × n positive semidefinite matrix

Gi,j = xi · xj .

27
27/69
/69
The Gram matrix

The Gram matrix of {x1 , . . . , xn } is


the n × n positive semidefinite matrix

Gi,j = xi · xj .

(Again, we assume that the data has


P
been centered, i.e., i xi = 0 .)

27
27/69
/69
The Gram matrix

The Gram matrix of {x1 , . . . , xn } is


the n × n positive semidefinite matrix

Gi,j = xi · xj .

(Again, we assume that the data has


P
been centered, i.e., i xi = 0 .)
Jørgen Pedersen Gram
1850–1916

Exercise: Prove that if x1 , . . . , xn ∈ Rd , then rank(G) ≤ d.

27
27/69
/69
Classical MDS

Proposition 1. The CMDS problem can equivalently be written as


minimizing
E = k G − G∗ k2Frob ,
where G is the centered Gram matrix of {x1 , . . . , xn } and G∗ is the
Gram matrix of {y1 , . . . , yn } .

28
28/69
/69
Classical MDS

Proposition 1. The CMDS problem can equivalently be written as


minimizing
E = k G − G∗ k2Frob ,
where G is the centered Gram matrix of {x1 , . . . , xn } and G∗ is the
Gram matrix of {y1 , . . . , yn } .

Approach:
1. Compute the centered Gram matrix G .

28
28/69
/69
Classical MDS

Proposition 1. The CMDS problem can equivalently be written as


minimizing
E = k G − G∗ k2Frob ,
where G is the centered Gram matrix of {x1 , . . . , xn } and G∗ is the
Gram matrix of {y1 , . . . , yn } .

Approach:
1. Compute the centered Gram matrix G .
2. Solve G∗ = argminG̃⪰0, rank(G̃)≤p k G̃ − G k2Frob .

28
28/69
/69
Classical MDS

Proposition 1. The CMDS problem can equivalently be written as


minimizing
E = k G − G∗ k2Frob ,
where G is the centered Gram matrix of {x1 , . . . , xn } and G∗ is the
Gram matrix of {y1 , . . . , yn } .

Approach:
1. Compute the centered Gram matrix G .
2. Solve G∗ = argminG̃⪰0, rank(G̃)≤p k G̃ − G k2Frob .
3. Find y1 , y2 , . . . , yn ∈ Rp with Gram matrix G∗ .

28
28/69
/69
Classical MDS

Proposition 2. Let G = QΛQ⊤ be the eigendecomposition of the Gram


matrix with Λ = diag(λ1 , . . . , λd ) and λ1 ≥ . . . ≥ λd .

29
29/69
/69
Classical MDS

Proposition 2. Let G = QΛQ⊤ be the eigendecomposition of the Gram


matrix with Λ = diag(λ1 , . . . , λd ) and λ1 ≥ . . . ≥ λd . Then

argmin k G̃ − G k2Frob = QΛ∗ Q⊤ ,


G̃⪰0, rank(G̃)≤p

where Λ∗ = diag(λ1 , . . . , λp , 0, 0, . . .).

29
29/69
/69
Classical MDS

Proposition 2. Let G = QΛQ⊤ be the eigendecomposition of the Gram


matrix with Λ = diag(λ1 , . . . , λd ) and λ1 ≥ . . . ≥ λd . Then

argmin k G̃ − G k2Frob = QΛ∗ Q⊤ ,


G̃⪰0, rank(G̃)≤p

where Λ∗ = diag(λ1 , . . . , λp , 0, 0, . . .).

Exercise: Prove this proposition.

29
29/69
/69
Gram → Data
Proposition 3. Let G ∈ Rn×n be a p.s.d. matrix of rank d with eigen-
decomposition
G = QΛQ⊤ .

30
30/69
/69
Gram → Data
Proposition 3. Let G ∈ Rn×n be a p.s.d. matrix of rank d with eigen-
decomposition
G = QΛQ⊤ .
Let xi = [QΛ1/2 ]⊤
i,∗ .

30
30/69
/69
Gram → Data
Proposition 3. Let G ∈ Rn×n be a p.s.d. matrix of rank d with eigen-
decomposition
G = QΛQ⊤ .
Let xi = [QΛ1/2 ]⊤
i,∗ . Then the Gram matrix of {x1 , . . . , xn } is G .

30
30/69
/69
Gram → Data
Proposition 3. Let G ∈ Rn×n be a p.s.d. matrix of rank d with eigen-
decomposition
G = QΛQ⊤ .
Let xi = [QΛ1/2 ]⊤
i,∗ . Then the Gram matrix of {x1 , . . . , xn } is G .

Notation:
• Mi,∗ denotes the i ’th row of M .
• Given D = diag(d1 , . . . , dm ) , D p := diag(dp1 , . . . , dpm ) .

30
30/69
/69
Gram → Data
Proposition 3. Let G ∈ Rn×n be a p.s.d. matrix of rank d with eigen-
decomposition
G = QΛQ⊤ .
Let xi = [QΛ1/2 ]⊤
i,∗ . Then the Gram matrix of {x1 , . . . , xn } is G .

Notation:
• Mi,∗ denotes the i ’th row of M .
• Given D = diag(d1 , . . . , dm ) , D p := diag(dp1 , . . . , dpm ) .

Exercise: Prove this proposition.

30
30/69
/69
Summary of Classical MDS

1. Compute the centered Gram matrix G (see homework for how).

31
31/69
/69
Summary of Classical MDS

1. Compute the centered Gram matrix G (see homework for how).


2. Compute the eigendecomposition QΛQ⊤ of G .

31
31/69
/69
Summary of Classical MDS

1. Compute the centered Gram matrix G (see homework for how).


2. Compute the eigendecomposition QΛQ⊤ of G .
3. Assuming Λ = diag(λ1 , . . . , λd ) and λ1 ≥ . . . ≥ λd , set
Λ∗ = diag(λ1 , . . . , λp , 0, 0, . . .) and G∗ = QΛ∗ Q⊤ .

31
31/69
/69
Summary of Classical MDS

1. Compute the centered Gram matrix G (see homework for how).


2. Compute the eigendecomposition QΛQ⊤ of G .
3. Assuming Λ = diag(λ1 , . . . , λd ) and λ1 ≥ . . . ≥ λd , set
Λ∗ = diag(λ1 , . . . , λp , 0, 0, . . .) and G∗ = QΛ∗ Q⊤ .
4. Let yi = [QΛ1/2 ]⊤
i,∗ .

31
31/69
/69
Isomap
Tenenbaum, de Silva & Langford, 2000
Isomap

1. Convert data into a graph (e.g., a symmetrized k -nn graph).

33
33/69
/69
Isomap

1. Convert data into a graph (e.g., a symmetrized k -nn graph).


2. Compute all pairs shortest path distances.

33
33/69
/69
Isomap

1. Convert data into a graph (e.g., a symmetrized k -nn graph).


2. Compute all pairs shortest path distances.
3. Use MDS to compute ϕ : Rd → Rp that tries to preserve these
distances.

33
33/69
/69
Isomap

1. Convert data into a graph (e.g., a symmetrized k -nn graph).


2. Compute all pairs shortest path distances.
3. Use MDS to compute ϕ : Rd → Rp that tries to preserve these
distances.

Underlying assumptions:
1. Data lies on a manifold.

33
33/69
/69
Isomap

1. Convert data into a graph (e.g., a symmetrized k -nn graph).


2. Compute all pairs shortest path distances.
3. Use MDS to compute ϕ : Rd → Rp that tries to preserve these
distances.

Underlying assumptions:
1. Data lies on a manifold.
2. Goedesic distance on manifold is approximated by distance in the graph.

33
33/69
/69
Isomap

1. Convert data into a graph (e.g., a symmetrized k -nn graph).


2. Compute all pairs shortest path distances.
3. Use MDS to compute ϕ : Rd → Rp that tries to preserve these
distances.

Underlying assumptions:
1. Data lies on a manifold.
2. Goedesic distance on manifold is approximated by distance in the graph.
3. The optimal embedding preserves these distances as much as possible.
33
33/69
/69
Shortest path distances

Let G be a weighted graph with vertex set {1, 2, . . . , n} , and a distance


(δi,j )ni,j=1 on each edge.

34
34/69
/69
Shortest path distances

Let G be a weighted graph with vertex set {1, 2, . . . , n} , and a distance


(δi,j )ni,j=1 on each edge. If i and j are not neighbors, then set δi,j = ∞ .
If i = j , then set δi,j = 0 .

34
34/69
/69
Shortest path distances

Let G be a weighted graph with vertex set {1, 2, . . . , n} , and a distance


(δi,j )ni,j=1 on each edge. If i and j are not neighbors, then set δi,j = ∞ .
If i = j , then set δi,j = 0 .

The shortest path distance in G from i to j is

X
ℓ−1
d(i, j) = min δvk ,vk+1 ,
(v1 ,v2 ,...,vℓ )∈P(i,j)
k=1

where P is the set of paths that start at i and end at j (i.e., v1 = i and
vℓ = j ).

34
34/69
/69
Shortest path distances

Let G be a weighted graph with vertex set {1, 2, . . . , n} , and a distance


(δi,j )ni,j=1 on each edge. If i and j are not neighbors, then set δi,j = ∞ .
If i = j , then set δi,j = 0 .

The shortest path distance in G from i to j is

X
ℓ−1
d(i, j) = min δvk ,vk+1 ,
(v1 ,v2 ,...,vℓ )∈P(i,j)
k=1

where P is the set of paths that start at i and end at j (i.e., v1 = i and
vℓ = j ).

34
34/69
/69
Shortest path distances

Proposition. The matrix D of all pairwise distances (Di,j = d(i, j)) can
be computed in O(n3 ) time.

35
35/69
/69
Shortest path distances

Proposition. The matrix D of all pairwise distances (Di,j = d(i, j)) can
be computed in O(n3 ) time.

Proposition. Let D (k) be the matrix of shortest path distances along the
restricted set of paths where each intermediate vertex comes from
{1, 2, . . . , k} . Then D(k) can be computed from D(k−1) in O(n2 ) time.

35
35/69
/69
Floyd–Warshall algorithm

INPUT: matrix A with Ai,j = δi,j as on previous slide;


for k = 1 to n {
for i = 1 to n {
for j = 1 to n {
if (Ai,j > Ai,k + Ak,j ) then Ai,j ← Ai,k + Ak,j ;
}
}
}
OUTPUT: matrix A, in which Ai,j is shortest path
distance from vertex i to j

36
36/69
/69
Floyd–Warshall algorithm

INPUT: matrix A with Ai,j = δi,j as on previous slide;


for k = 1 to n {
for i = 1 to n {
for j = 1 to n {
if (Ai,j > Ai,k + Ak,j ) then Ai,j ← Ai,k + Ak,j ;
}
}
}
OUTPUT: matrix A, in which Ai,j is shortest path
distance from vertex i to j

Overall complexity: O(n3 ) .

36
36/69
/69
Isomap example

37
37/69
/69
Isomap example

38
38/69
/69
Isomap example

39
39/69
/69
Properties of Isomap

• One of the first algorithms that can deal with manifolds.

40
40/69
/69
Properties of Isomap

• One of the first algorithms that can deal with manifolds.


• The topology must still be that of (a patch of) Rp .

40
40/69
/69
Properties of Isomap

• One of the first algorithms that can deal with manifolds.


• The topology must still be that of (a patch of) Rp .
• Relatively efficient computation O(n3 ) .

40
40/69
/69
Properties of Isomap

• One of the first algorithms that can deal with manifolds.


• The topology must still be that of (a patch of) Rp .
• Relatively efficient computation O(n3 ) .
• Fragile: a single mistake in k –nn graph can mess up embedding.

40
40/69
/69
Properties of Isomap

• One of the first algorithms that can deal with manifolds.


• The topology must still be that of (a patch of) Rp .
• Relatively efficient computation O(n3 ) .
• Fragile: a single mistake in k –nn graph can mess up embedding.
• Not obvious how to set k .

40
40/69
/69
Locally Linear Embedding (LLE)
Roweis & Saul, 2000
LLE
Again trying to find an embedding RD → Rd , mapping xi 7→ yi .

42
42/69
/69
LLE
Again trying to find an embedding RD → Rd , mapping xi 7→ yi . Again
start with a k -nn graph based on distances in RD .

42
42/69
/69
LLE
Again trying to find an embedding RD → Rd , mapping xi 7→ yi . Again
start with a k -nn graph based on distances in RD .

IDEA: Each point should be approximately reconstructable as a linear


combination of its neighbors (locally linear property of manifolds):
X
xi ≈ wi,j xj ,
j∈knn(i)
P
where (wi,j )i,j is a matrix of weights. Also have constraints j wi,j = 1 .

42
42/69
/69
LLE
Again trying to find an embedding RD → Rd , mapping xi 7→ yi . Again
start with a k -nn graph based on distances in RD .

IDEA: Each point should be approximately reconstructable as a linear


combination of its neighbors (locally linear property of manifolds):
X
xi ≈ wi,j xj ,
j∈knn(i)
P
where (wi,j )i,j is a matrix of weights. Also have constraints j wi,j = 1 .

Now find an embedding that preserves these weights, i.e., n vectors


y1 , . . . , yn ∈ Rp , such that
X
yi ≈ wi,j yj
j

for the same matrix of weights.


42
42/69
/69
Phase 1: find the weights

Do this separately for each i . Formulate it as minimizing


w X w2 X
Φ = w xi − wi,j xj w s.t. wi,j = 1.
j∈knn(i) j

43
43/69
/69
Phase 1: find the weights

Do this separately for each i . Formulate it as minimizing


w X w2 X
Φ = w xi − wi,j xj w s.t. wi,j = 1.
j∈knn(i) j

Solution. Thanks to the constraint,


w X w2
Φ=w wi,j (xi − xj ) w =
j∈knn(i)

43
43/69
/69
Phase 1: find the weights

Do this separately for each i . Formulate it as minimizing


w X w2 X
Φ = w xi − wi,j xj w s.t. wi,j = 1.
j∈knn(i) j

Solution. Thanks to the constraint,


w X w2
Φ=w wi,j (xi − xj ) w = w⊤ K (i) w,
j∈knn(i)

where K (i) is the local Gram matrix, Kj,j ′ = (xi − xj )⊤ (xi − xj ) , and
(i)

w = (wj )j∈knn(i) .

43
43/69
/69
Phase 1: find the weights
The local optimization problem is

minimize w⊤K (i) w s.t. w⊤ 1 = 1.


w

44
44/69
/69
Phase 1: find the weights
The local optimization problem is

minimize w⊤K (i) w s.t. w⊤ 1 = 1.


w

Introduce the Lagrangian:

L(λ) = w⊤K (i) w − λ(w⊤ 1 − 1)

44
44/69
/69
Phase 1: find the weights
The local optimization problem is

minimize w⊤K (i) w s.t. w⊤ 1 = 1.


w

Introduce the Lagrangian:

L(λ) = w⊤K (i) w − λ(w⊤ 1 − 1)

and solve


L(w) =
∂wj

44
44/69
/69
Phase 1: find the weights
The local optimization problem is

minimize w⊤K (i) w s.t. w⊤ 1 = 1.


w

Introduce the Lagrangian:

L(λ) = w⊤K (i) w − λ(w⊤ 1 − 1)

and solve

∂  
L(w) = 2K (i) w − λ1 j = 0 j ∈ knn(i)
∂wj

44
44/69
/69
Phase 1: find the weights
The local optimization problem is

minimize w⊤K (i) w s.t. w⊤ 1 = 1.


w

Introduce the Lagrangian:

L(λ) = w⊤K (i) w − λ(w⊤ 1 − 1)

and solve

∂  
L(w) = 2K (i) w − λ1 j = 0 j ∈ knn(i)
∂wj

(K (i) )−1 1
w = λ(K (i) )−1 1 enforcing constraints: w= .
k (K (i) )−1 1 k1

44
44/69
/69
Phase 2: find the y i ’s

Now minimize (w.r.t. y1 , . . . , yn )


Xw X w2 X 1X
Ψ= w yi − wi,j yj w s.t. yi = 0 yi y⊤
i = I.
n
i j i i

Solution. X
Ψ= y⊤
i M yj . . .
i,j

45
/69
45/69
46
46/69
/69
Laplacian Eigenmaps
Belkin and Niyogi, 2002
Spectral Graph Theory

Spectral graph theory is about relating functions on graphs (i.e., f : V → R


where V is the vertex set of the graph) to the structure of the graph.

48
48/69
/69
Unweighted graphs
Let G be an unweighted, undirected graph with vertex set
V = {1, 2, . . . , n} and edge set E ⊆ V × V .
• The adjacency matrix of G is the matrix A ∈ {0, 1}n×n with
(
1 if i ∼ j
A=
0 otherwise,

where i ∼ j means that vertices i and j are adjacent.

49
49/69
/69
Unweighted graphs
Let G be an unweighted, undirected graph with vertex set
V = {1, 2, . . . , n} and edge set E ⊆ V × V .
• The adjacency matrix of G is the matrix A ∈ {0, 1}n×n with
(
1 if i ∼ j
A=
0 otherwise,

where i ∼ j means that vertices i and j are adjacent.


• The degree matrix of G is D = diag(d(1), d(2), . . . , d(n)) , where
d(i) is the degree (number of neighbors) of vertex i .

49
49/69
/69
Unweighted graphs
Let G be an unweighted, undirected graph with vertex set
V = {1, 2, . . . , n} and edge set E ⊆ V × V .
• The adjacency matrix of G is the matrix A ∈ {0, 1}n×n with
(
1 if i ∼ j
A=
0 otherwise,

where i ∼ j means that vertices i and j are adjacent.


• The degree matrix of G is D = diag(d(1), d(2), . . . , d(n)) , where
d(i) is the degree (number of neighbors) of vertex i .
• The Laplacian matrix of G is

L = D − A.

49
49/69
/69
Laplacian as a quadratic form
The Laplacian can be written as


 1 if p = q = i or p = q = j
X
L= Ei,j where [Ei,j ]p,q = −1 if (p, q) = (i, j) or (p, q) = (j, i)


i∼j 0 otherwise.

50
50/69
/69
Laplacian as a quadratic form
The Laplacian can be written as


 1 if p = q = i or p = q = j
X
L= Ei,j where [Ei,j ]p,q = −1 if (p, q) = (i, j) or (p, q) = (j, i)


i∼j 0 otherwise.

Therefore we have the fundamental identity that for any f ∈ Rn ,


X
f ⊤Lf = (f (i) − f (j))2 .
i∼j

50
50/69
/69
Laplacian as a quadratic form
The Laplacian can be written as


 1 if p = q = i or p = q = j
X
L= Ei,j where [Ei,j ]p,q = −1 if (p, q) = (i, j) or (p, q) = (j, i)


i∼j 0 otherwise.

Therefore we have the fundamental identity that for any f ∈ Rn ,


X
f ⊤Lf = (f (i) − f (j))2 .
i∼j

Equivalently (and confusingly),

1 X
f ⊤Lf = (f (i) − f (j))2 .
2
(i,j)∈E

50
50/69
/69
Laplacian as a quadratic form
The Laplacian can be written as


 1 if p = q = i or p = q = j
X
L= Ei,j where [Ei,j ]p,q = −1 if (p, q) = (i, j) or (p, q) = (j, i)


i∼j 0 otherwise.

Therefore we have the fundamental identity that for any f ∈ Rn ,


X
f ⊤Lf = (f (i) − f (j))2 .
i∼j

Equivalently (and confusingly),

1 X
f ⊤Lf = (f (i) − f (j))2 .
2
(i,j)∈E

Exercise: Prove that L is a psd matrix.


50
50/69
/69
Weighted graphs
Let G be a weighted, undirected graph with edge weights (wi,j )i,j . Note
that wi,j=wj,i , and if i 6∼ j , then wi,j = 0 .
• The adjacency matrix of G is the matrix A ∈ (R+ )n×n with
(
wi,j if i 6= j
A=
0 if i = j.

51
51/69
/69
Weighted graphs
Let G be a weighted, undirected graph with edge weights (wi,j )i,j . Note
that wi,j=wj,i , and if i 6∼ j , then wi,j = 0 .
• The adjacency matrix of G is the matrix A ∈ (R+ )n×n with
(
wi,j if i 6= j
A=
0 if i = j.

• The degree matrix of G is D = diag(d(1), d(2), . . . , d(n)) , where


X
d(i) = wi,j .
j̸=j

51
51/69
/69
Weighted graphs
Let G be a weighted, undirected graph with edge weights (wi,j )i,j . Note
that wi,j=wj,i , and if i 6∼ j , then wi,j = 0 .
• The adjacency matrix of G is the matrix A ∈ (R+ )n×n with
(
wi,j if i 6= j
A=
0 if i = j.

• The degree matrix of G is D = diag(d(1), d(2), . . . , d(n)) , where


X
d(i) = wi,j .
j̸=j

• The Laplacian matrix of G is

L = D − A.

51
51/69
/69
The normalized Laplacian

When the degree distribution is uneven, it is often much better to work with
the normalized Laplacian

L̃ = D−1/2 LD−1/2 = I − D−1/2 AD−1/2 .

52
52/69
/69
Example: cycle graph

53
53/69
/69
Example: cycle graph

fk (vi ) = sin(2πki/n) k = 1, 2, . . . bn/2c


gk (vi ) = cos(2πki/n) k = 0, 1, 2, . . . , b(n − 1)/2c,

53
53/69
/69
Example: path graph

54
54/69
/69
Connectivity

Theorem
The multiplicity of 0 in the spectrum of L (i.e., the number of zero
eigenvalues) is the number of connected components of G .

55
55/69
/69
Connectivity

Theorem
The multiplicity of 0 in the spectrum of L (i.e., the number of zero
eigenvalues) is the number of connected components of G .

55
55/69
/69
Fiedler vector

Let λ1 ≤ λ2 ≤ . . . ≤ λn be the eigenvalues of L , and v1 , v2 , . . . , vn be


the corresponding normalized eigenvectors.

56
56/69
/69
Fiedler vector

Let λ1 ≤ λ2 ≤ . . . ≤ λn be the eigenvalues of L , and v1 , v2 , . . . , vn be


the corresponding normalized eigenvectors.
• By the above, λ1 = 0 and v1 = √1 1 for any graph.
n

56
56/69
/69
Fiedler vector

Let λ1 ≤ λ2 ≤ . . . ≤ λn be the eigenvalues of L , and v1 , v2 , . . . , vn be


the corresponding normalized eigenvectors.
• By the above, λ1 = 0 and v1 = √1 1 for any graph.
n
• The second eigenvector v2 , is called the Fiedler vector, and is
particularly informative about how to cluster the graph.

56
56/69
/69
Cheeger’s inequality
P P
Let S ⊂ V , = V \ S , and E(S, S) = i∈S j∈s wi,j .

57
57/69
/69
Cheeger’s inequality
P P
Let S ⊂ V , = V \ S , and E(S, S) = j∈s wi,j . Further for any
P i∈S
W ⊆ V , let d(W ) = i∈W d(i) .
• The conductance of S is defined as

E(S, S)
ϕ(S) = d(V ) .
d(S) d(S)

57
57/69
/69
Cheeger’s inequality
P P
Let S ⊂ V , = V \ S , and E(S, S) = j∈s wi,j . Further for any
P i∈S
W ⊆ V , let d(W ) = i∈W d(i) .
• The conductance of S is defined as

E(S, S)
ϕ(S) = d(V ) .
d(S) d(S)

• The conductance of the whole graph is ϕG = minS⊂V ϕ(S) .

57
57/69
/69
Cheeger’s inequality
P P
Let S ⊂ V , = V \ S , and E(S, S) = j∈s wi,j . Further for any
P i∈S
W ⊆ V , let d(W ) = i∈W d(i) .
• The conductance of S is defined as

E(S, S)
ϕ(S) = d(V ) .
d(S) d(S)

• The conductance of the whole graph is ϕG = minS⊂V ϕ(S) .


• Cheeger’s inequality states that

ϕ2G
≤ λ2 ≤ ϕ G ,
2dmax
where dmax is the maximum degree of any vertex in G .

57
57/69
/69
Example

58
58/69
/69
Example

59
59/69
/69
Example

The first few eigenvectors can be used for clustering → spectral graph
partitioning

60
60/69
/69
The Laplace–Beltrami operator
The graph Laplacian is the discrete analog of the Laplace-Beltrami operator.
• The Laplacian operator on Rd is

∂2 ∂2 ∂2
∆ = ∇2 = + + . . . + .
∂x21 ∂x22 ∂x2d

61
61/69
/69
The Laplace–Beltrami operator
The graph Laplacian is the discrete analog of the Laplace-Beltrami operator.
• The Laplacian operator on Rd is

∂2 ∂2 ∂2
∆ = ∇2 = + + . . . + .
∂x21 ∂x22 ∂x2d

• More generally, the Laplace–Beltrami operator on a d dimensional


Riemannian manifold with metric tensor g is

1 X d p
∆= √ ∂i det gg i,j ∂j .
det g i,j=1

The graph Laplacian can be regarded as a discretization of these operators.

61
61/69
/69
Discretization of Laplacian
• In R , the (finite difference) discretization of ∇ = ∂x

is derived from

∂  f (x + h/2) − f (x − h/2)
(∇f )(x) = f (x) = .
∂x h

62
62/69
/69
Discretization of Laplacian
• In R , the (finite difference) discretization of ∇ = ∂x

is derived from

∂  f (x + h/2) − f (x − h/2)
(∇f )(x) = f (x) = .
∂x h

• The discretization of ∆ is derived from

(∆f )(x) = (∇(∇f ))(x) =

62
62/69
/69
Discretization of Laplacian
• In R , the (finite difference) discretization of ∇ = ∂x

is derived from

∂  f (x + h/2) − f (x − h/2)
(∇f )(x) = f (x) = .
∂x h

• The discretization of ∆ is derived from

(∇f )(x + h/2) − (∇f )(x − h/2)


(∆f )(x) = (∇(∇f ))(x) = =
h

62
62/69
/69
Discretization of Laplacian
• In R , the (finite difference) discretization of ∇ = ∂x

is derived from

∂  f (x + h/2) − f (x − h/2)
(∇f )(x) = f (x) = .
∂x h

• The discretization of ∆ is derived from

(∇f )(x + h/2) − (∇f )(x − h/2)


(∆f )(x) = (∇(∇f ))(x) = =
h

f (x − h) − 2f (x) + f (x + h)
= .
h2

62
62/69
/69
Discretization of Laplacian
• In R , the (finite difference) discretization of ∇ = ∂x

is derived from

∂  f (x + h/2) − f (x − h/2)
(∇f )(x) = f (x) = .
∂x h

• The discretization of ∆ is derived from

(∇f )(x + h/2) − (∇f )(x − h/2)


(∆f )(x) = (∇(∇f ))(x) = =
h

f (x − h) − 2f (x) + f (x + h)
= .
h2
If we regard f as a vector, f = (. . . , f (x − h), f (x), f (x + h), . . .)⊤ ,
then the latter is just −Lf /h2 , where L is the Laplacian of the line graph.

62
62/69
/69
Discretization of Laplacian
• In R , the (finite difference) discretization of ∇ = ∂x

is derived from

∂  f (x + h/2) − f (x − h/2)
(∇f )(x) = f (x) = .
∂x h

• The discretization of ∆ is derived from

(∇f )(x + h/2) − (∇f )(x − h/2)


(∆f )(x) = (∇(∇f ))(x) = =
h

f (x − h) − 2f (x) + f (x + h)
= .
h2
If we regard f as a vector, f = (. . . , f (x − h), f (x), f (x + h), . . .)⊤ ,
then the latter is just −Lf /h2 , where L is the Laplacian of the line graph.
Similarly for grids on Rd .

62
62/69
/69
Discretization of Laplacian
• In R , the (finite difference) discretization of ∇ = ∂x

is derived from

∂  f (x + h/2) − f (x − h/2)
(∇f )(x) = f (x) = .
∂x h

• The discretization of ∆ is derived from

(∇f )(x + h/2) − (∇f )(x − h/2)


(∆f )(x) = (∇(∇f ))(x) = =
h

f (x − h) − 2f (x) + f (x + h)
= .
h2
If we regard f as a vector, f = (. . . , f (x − h), f (x), f (x + h), . . .)⊤ ,
then the latter is just −Lf /h2 , where L is the Laplacian of the line graph.
Similarly for grids on Rd . hf, ∆f i is a natural measure of roughness of f
→ sheds new light on L as a quadratic form.
62
62/69
/69
The heat equation
The flow of heat in a homogenous medium is governed by the equation


f (x, t) = κ∆f (x, t).
∂t

63
63/69
/69
The heat equation
The flow of heat in a homogenous medium is governed by the equation


f (x, t) = κ∆f (x, t).
∂t
∆ is a negative definite self-adjoint operator.

63
63/69
/69
The heat equation
The flow of heat in a homogenous medium is governed by the equation


f (x, t) = κ∆f (x, t).
∂t
∆ is a negative definite self-adjoint operator. Solution to this is
1 1
f (x, t) = eκt∆ f (x, 0) where eT := I + T + T 2 + T 3 + . . . .
2 6

63
63/69
/69
The heat equation
The flow of heat in a homogenous medium is governed by the equation


f (x, t) = κ∆f (x, t).
∂t
∆ is a negative definite self-adjoint operator. Solution to this is
1 1
f (x, t) = eκt∆ f (x, 0) where eT := I + T + T 2 + T 3 + . . . .
2 6
In particular, if our domain M is compact, then the eigenfunctions of ∆ ,
i.e., ∆gi = λi gi form a basis for M and
X X
f (x, 0) = αi g i f (x, 0) = eλi κt αi gi .
i i

63
63/69
/69
The heat equation
The flow of heat in a homogenous medium is governed by the equation


f (x, t) = κ∆f (x, t).
∂t
∆ is a negative definite self-adjoint operator. Solution to this is
1 1
f (x, t) = eκt∆ f (x, 0) where eT := I + T + T 2 + T 3 + . . . .
2 6
In particular, if our domain M is compact, then the eigenfunctions of ∆ ,
i.e., ∆gi = λi gi form a basis for M and
X X
f (x, 0) = αi g i f (x, 0) = eλi κt αi gi .
i i

The long time behavior of the system is determined by the low | λi | modes!!!

63
/69
63/69
Laplacian Eigenmaps
[Belkin&Niyogi]
• Turn dimensionality reduction into a
graph problem by forming knn-mesh,
possibly weighted by
wi,j = exp(− kxi − xj k2 /(2σ 2 ))

64
64/69
/69
Laplacian Eigenmaps
[Belkin&Niyogi]
• Turn dimensionality reduction into a
graph problem by forming knn-mesh,
possibly weighted by
wi,j = exp(− kxi − xj k2 /(2σ 2 ))
• Embed according to first p non-zero
e-value e-vectors:
 
v1 (i)
 
ϕ : V → Rp i 7→  ... 
vp+1 (i)

64
64/69
/69
Laplacian Eigenmaps
[Belkin&Niyogi]
• Turn dimensionality reduction into a
graph problem by forming knn-mesh,
possibly weighted by
wi,j = exp(− kxi − xj k2 /(2σ 2 ))
• Embed according to first p non-zero
e-value e-vectors:
 
v1 (i)
 
ϕ : V → Rp i 7→  ... 
vp+1 (i)

• Intuition: these are the smoothest


functions on the graph, and they give
global coordinates

64
64/69
/69
Laplacian Eigenmaps: detail

Formulate the problem as minimizing the strain


X w w2
E= wi,j wyi − yj w = 2tr(Y ⊤ LY ).
i,j

65
65/69
/69
Laplacian Eigenmaps: detail

Formulate the problem as minimizing the strain


X w w2
E= wi,j wyi − yj w = 2tr(Y ⊤ LY ).
i,j

Adding the additional constraint Y ⊤ DY = I , after some algebra, this leads


to the generalized eigenvalue problem LY = DY .

65
/69
65/69
66
66/69
/69
Three different metrics

Laplacian eigenmaps corresponds to PCA w.r.t. the diffusion metric on the


manifold, because the diffusion (heat) kernel is exactly e−βL [K and Lafferty,
2001].

67
/69
67/69
68
68/69
/69
69
69/69
/69

You might also like