Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
44 views49 pages

Lecture 4

The document discusses advanced concepts in Principal Component Analysis (PCA) and kernel methods for classification, focusing on the use of momentum and Chebyshev polynomials to enhance PCA convergence. It explains the limitations of linear models and introduces kernel methods to enable more complex decision boundaries while maintaining computational efficiency through the Gram matrix. The document emphasizes the importance of kernel properties and provides various examples of kernels used in machine learning.

Uploaded by

rysul12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views49 pages

Lecture 4

The document discusses advanced concepts in Principal Component Analysis (PCA) and kernel methods for classification, focusing on the use of momentum and Chebyshev polynomials to enhance PCA convergence. It explains the limitations of linear models and introduces kernel methods to enable more complex decision boundaries while maintaining computational efficiency through the Gram matrix. The document emphasizes the importance of kernel properties and provides various examples of kernels used in machine learning.

Uploaded by

rysul12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

The Kernel Trick, Gram

Matrices, and Feature Extraction


CS6787 Lecture 4 — Fall 2017
Momentum for Principle
Component Analysis
CS6787 Lecture 3.1 — Fall 2017
Principle Component Analysis
• Setting: find the dominant eigenvalue-eigenvector pair of a positive
semidefinite symmetric matrix A.
T
x Ax
u1 = arg max T
x x x
• Many ways to write this problem, e.g.
kBkF is Frobenius norm
XX
p kBk2F = 2
Bi,j
T 2
1 u1 = arg min kxx AkF i j
x
PCA: A Non-Convex Problem
• PCA is not convex in any of these formulations

• Why? Think about the solutions to the problem: u and –u


• Two distinct solutions à can’t be convex

• Can we still use momentum to run PCA more quickly?


Power Iteration
• Before we apply momentum, we need to choose what base algorithm
we’re using.

• Simplest algorithm: power iteration


• Repeatedly multiply by the matrix A to get an answer

xt+1 = Axt
Why does Power Iteration Work?
n
X
• Let eigendecomposition of A be A = u u
i i i
T

i=1

• For 1 > 2 1 ··· n

• PI converges in direction because cosine-squared of angle to u1 is


T 2 T t 2 2t T 2
(u x t ) (u A x 0 ) (u x
1 0 )
cos2 (✓) = 1 2 = 1 t = P n
1
2t (uT x )2
kxt k kA x0 k2 i=1 i i 0
Pn ✓ ◆ !
2t T 2 2t
i=2 i (u x
i 0 ) 2
= 1 Pn 2t (uT x )2
= 1 ⌦
i=1 i i 0 1
What about a more general algorithm?
• Use both current iterate, and history of past iterations

xt+1 = ↵t Axt + t,1 xt 1 + t,2 xt 2 + ··· + t,t x0


• for fixed parameters ⍺ and β

• What class of functions can we express in this form?

• Notice: xt is always a degree-t polynomial in A times x0


• Can prove by induction that we can express ANY polynomial
Power Iteration and Polynomials
• Can also think of power iteration as a degree-t polynomial of A

x t = At x 0
• Is there a better degree-t polynomial to use than ft (x) = xt ?
• If we use a different polynomial, then we get
n
X
xt = ft (A)x0 = ft ( i )ui uTi x0
i=1

• Ideal solution: choose polynomial with zeros at all non-dominant eigenvalues


• Practically, make ft ( 1 ) to be as large as possible and |ft ( )| < 1 for all | | < 2
Chebyshev Polynomials Again
• It turns out that Chebyshev polynomials solve this problem.

• Recall: T0 (x) = 0, T1 (x) = x and

Tn+1 (x) = 2xTn (x) Tn 1 (x)

• Nice properties:

|x|  1 ) |Tn (x)|  1


Chebyshev Polynomials
1.5

0.5

0
T0 (u) = 1
-0.5

-1

-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Chebyshev Polynomials
1.5

0.5

0
T1 (u) = u
-0.5

-1

-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Chebyshev Polynomials
1.5

0.5

0
T2 (u) = 2u2 1
-0.5

-1

-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Chebyshev Polynomials
1.5

0.5

-0.5

-1

-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Chebyshev Polynomials
1.5

0.5

-0.5

-1

-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Chebyshev Polynomials
1.5

0.5

-0.5

-1

-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Chebyshev Polynomials
1.5

0.5

-0.5

-1

-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Chebyshev Polynomials Again
• It turns out that Chebyshev polynomials solve this problem.

• Recall: T0 (x) = 0, T1 (x) = x and

Tn+1 (x) = 2xTn (x) Tn 1 (x)

• Nice properties:
⇣⇣ p ⌘n ⌘
|x|  1 ) |Tn (x)|  1 Tn (1 + ✏) ⇡ ⇥ 1 + 2✏
Using Chebyshev Polynomials
• So we can choose our polynomial f in terms of T
• Want: ft ( 1 ) to be as large as possible, subject to |ft ( )| < 1 for all | | < 2

• To make this work, set ✓ ◆


x
fn (x) = Tn
2
• Can do this by running the update
✓ ◆
2A A
xt+1 = xt xt 1 ) x t = Tt x0
2 2
Convergence of Momentum PCA
Pn ⇣ ⌘
T
i=1 Tt
u u x
i
xt 2
i i 0
=r ⇣ ⌘
kxt k Pn 2 T x )2
T
i=1 t
i
2
(u i 0

• Cosine-squared of angle to dominant component:


⇣ ⌘
(uT1 xt )2 Tt2
2
(u T
x
1 0 )12
2 ⇣ ⌘
cos (✓) = =P
kxt k2 n
T 2 i
(u T x )2
i=1 t 2 i 0
Convergence of Momentum PCA (continued)
⇣ ⌘
(uT1 xt )2 Tt2 2
1
(u T
1 0x ) 2

cos2 (✓) = =P ⇣ ⌘
kxt k2 n
T 2 i
(u T x )2
i=1 t 2 i 0
Pn ⇣ ⌘
2 T 2
T
i=2 t
i
2
(u x
i 0 )
=1 Pn ⇣ ⌘
T 2 i
(u T x )2
i=1 t 2 i 0
Pn T 2
✓ ✓ ◆◆
i=2 (ui x0 ) 2 1
1 ⇣ ⌘ = 1 ⌦ Tt
Tt2 12 (uT1 x0 )2 2
Convergence of Momentum PCA (continued)
✓ ✓ ◆◆ ✓ ✓ ◆◆
2 1 2 1 2
cos2 (✓) 1 ⌦ Tt =1 ⌦ Tt 1+
2 2
0 r ! 1 2t
1 2
=1 ⌦@ 1 + 2 A
2

• Recall that standard power iteration had:


✓ ◆2t ! ✓ ◆ 2t
!
2 1 2
cos2 (✓) = 1 ⌦ =1 ⌦ 1+
1 2

• So the momentum rate is asymptotically faster than power iteration


Questions?
The Kernel Trick, Gram
Matrices, and Feature Extraction
CS6787 Lecture 4 — Fall 2017
Basic Linear Models
• For classification using model vector w
T
output = sign(w x)
• Optimization methods vary; here’s logistic regression (yi 2 { 1, 1})
Xn
1 T
minimizew log 1 + exp( w xi yi )
n i=1
Benefits of Linear Models
• Fast classification: just one dot product

• Fast training/learning: just a few basic linear algebra operations

• Drawback: limited expressivity


• Can only capture linear classification boundaries à bad for many problems

• How do we let linear models represent a broader class of decision


boundaries, while retaining the systems benefits?
The Kernel Method
• Idea: in a linear model we can think about the similarity between two
training examples x and y as being
T
x y
• This is related to the rate at which a random classifier will separate x and y

• Kernel methods replace this dot-product similarity with an arbitrary


Kernel function that computes the similarity between x and y
K(x, y) : X ⇥ X ! R
Kernel Properties
• What properties do kernels need to have to be useful for learning?

• Key property: kernel must be symmetric K(x, y) = K(y, x)

• Key property: kernel must be positive semi-definite


n X
X n
8ci 2 R, xi 2 X , ci cj K(xi , xj ) 0
i=1 j=1
• Can check that the dot product has this property
Facts about Positive Semidefinite Kernels
• Sum of two PSD kernels is a PSD kernel

K(x, y) = K1 (x, y) + K2 (x, y) is a PSD kernel


• Product of two PSD kernels is a PSD kernel
K(x, y) = K1 (x, y)K2 (x, y) is a PSD kernel
• Scaling by any function on both sides is a kernel

K(x, y) = f (x)K1 (x, y)f (y) is a PSD kernel


Other Kernel Properties
• Useful property: kernels are often non-negative

K(x, y) 0

• Useful property: kernels are often scaled such that

K(x, y)  1, and K(x, y) = 1 , x = y


• These properties capture the idea that the kernel is expressing the similarity
between x and y
Common Kernels
• Gaussian kernel/RBF kernel: de-facto kernel in machine learning

K(x, y) = exp kx yk2

• We can validate that this is a kernel


• Symmetric? ✅
• Positive semi-definite? ✅ WHY?
• Non-negative? ✅
• Scaled so that K(x,x) = 1? ✅
Common Kernels (continued)
• Linear kernel: just the inner product K(x, y) = xT y

• Polynomial kernel: K(x, y) = (1 + xT y)p

• Laplacian kernel: K(x, y) = exp ( kx yk1 )

• Last layer of a neural network:


if last layer outputs (x), then kernel is K(x, y) = (x)T (y)
Classifying with Kernels
• An equivalent way of writing a linear model on a training set is
0 ! 1
n T
X
output(x) = sign @ w i xi xA
i=1

• We can kernel-ize this by replacing the dot products with kernel evals
n
!
X
output(x) = sign wi K(xi , x)
i=1
Learning with Kernels
• An equivalent way of writing linear-model logistic regression is
0 0 0 1T 11
Xn Xn
1 B B CC
minimizew log @1 + exp @ @ w j x j A x i y i AA
n i=1 j=1

• We can kernel-ize this by replacing the dot products with kernel evals
0 0 11
X n Xn
1
minimizew log @1 + exp @ wj yi K(xj , xi )AA
n i=1 j=1
The Computational Cost of Kernels
• Recall: benefit of learning with kernels is that we can express a wider
class of classification functions

• Recall: another benefit is linear classifier learning problems are


“easy” to solve because they are convex, and gradients easy to compute

• Major cost of learning naively with Kernels: have to evaluate K(x, y)


• For SGD, need to do this effectively n times per update
• Computationally intractable unless K is very simple
The Gram Matrix
• Address this computational problem by pre-computing the kernel
function for all pairs of training examples in the dataset.
Gi,j = K(xi , xj )
• Transforms the learning problem into
Xn
1
minimizew log 1 + exp yi eTi Gw
n i=1
• This is much easier than recomputing the kernel at each iteration
Problems with the Gram Matrix
• Suppose we have n examples in our training set.

• How much memory is required to store the Gram matrix G?

• What is the cost of taking the product Gi w to compute a gradient?

• What happens if we have one hundred million training examples?


Feature Extraction
• Simple case: let’s imagine that X is a finite set {1, 2, …, k}

• We can define our kernel as a matrix M 2 Rk⇥k


Mi,j = K(i, j)
• Since M is positive semidefinite, it has a square root U T U = M
k
X
Uk,i Uk,j = Mi,j = K(i, j)
i=1
Feature Extraction (continued)
• So if we define a feature mapping (i) = U ei then
k
X
(i)T (j) = Uk,i Uk,j = Mi,j = K(i, j)
i=1

• The kernel is equivalent to a dot product in some space

• In fact, this is true for all kernels, not just finite ones
Classifying with feature maps
• Suppose that we can find a finite-dimensional feature map that satisfies
T
(i) (j) = K(i, j)
• Then we can simplify our classifier to
n
!
X
output(x) = sign wi K(xi , x)
i=1
n
!
X
= sign wi (xi )T (x) = sign uT (x)
i=1
Learning with feature maps
• Similarly we can simplify our learning objective to
Xn
1 T
minimizeu log 1 + exp u (xi )yi
n i=1

• Take-away: this is just transforming the input data, then running a


linear classifier in the transformed space!

• Computationally: super efficient


• As long as we can transform and store the input data in an efficient way
Problems with Feature Maps
• The dimension of the transformed data may be much larger than the
dimension of the original data.

• Suppose that the feature map is : Rd ! RD and there are n examples

• How much memory is needed to store the transformed features?

• What is the cost of taking the product uT (xi ) to compute a gradient?


Feature Maps vs. Gram Matrices
• Systems trade-offs exist here.

• When number of examples gets very large, feature maps are better.

• When transformed feature vectors have high dimensionality, Gram


matrices are better.
Another Problem with Feature Maps
• Recall: I said there was always a feature map for any kernel such that
T
(i) (j) = K(i, j)
• But this feature map is not always finite-dimensional
• For example, the Gaussian/RBF kernel has an infinite-dimensional feature map
• Many kernels we care about in ML have this property

• What do we do if ɸ has infinite dimensions?


• We can’t just compute with it normally!
Solution: Approximate Feature Maps
• Find a finite-dimensional feature map so that
T
K(x, y) ⇡ (x) (y)
• Typically, we want to find a family of feature maps ɸt such that

D : Rd ! RD
T
lim D (x) D (y) = K(x, y)
D!1
Types of Approximate Feature Maps
• Deterministic feature maps
• Choose a fixed-a-priori method of approximating the kernel
• Generally not very popular because of the way they scale with dimensions

• Random feature maps


• Choose a feature map at random (typically each feature is independent) such that
⇥ T

E (x) (y) = K(x, y)
• Then prove with high probability that over some region of interest
T
(x) (y) K(x, y)  ✏
Types of Approximate Features (continued)
• Orthogonal randomized feature maps
• Intuition behind this: if we have a feature map where for some i and j

eTi (x) ⇡ eTj (x)


then we can’t actually learn much from having both features.
• Strategy: choose the feature map at random, but subject to the constraint that the
features be “orthogonal” in some way.

• Quasi-random feature maps


• Generate features using a low-discrepancy sequence rather than true randomness
Adaptive Feature Maps
• Everything before this didn’t take the data into account

• Adaptive feature maps look at the actual training set and try to minimize
the kernel approximation error using the training set as a guide
• For example: we can do a random feature map, and then fine-tune the
randomness to minimize the empirical error over the training set
• Gaining in popularity

• Also, neural networks can be thought of as adaptive feature maps.


Systems Tradeoffs
• Lots of tradeoffs here

• Do we spend more work up-front constructing a more sophisticated


approximation, to save work on learning algorithms?

• Would we rather scale with the data, or scale to more complicated


problems?

• Another task for metaparameter optimization


Questions
• Upcoming things:
• Paper 2 review due tonight
• Paper 3 in class on Wednesday
• Start thinking about the class project — it will come faster than you think!

You might also like