The Kernel Trick, Gram
Matrices, and Feature Extraction
CS6787 Lecture 4 — Fall 2017
Momentum for Principle
Component Analysis
CS6787 Lecture 3.1 — Fall 2017
Principle Component Analysis
• Setting: find the dominant eigenvalue-eigenvector pair of a positive
semidefinite symmetric matrix A.
T
x Ax
u1 = arg max T
x x x
• Many ways to write this problem, e.g.
kBkF is Frobenius norm
XX
p kBk2F = 2
Bi,j
T 2
1 u1 = arg min kxx AkF i j
x
PCA: A Non-Convex Problem
• PCA is not convex in any of these formulations
• Why? Think about the solutions to the problem: u and –u
• Two distinct solutions à can’t be convex
• Can we still use momentum to run PCA more quickly?
Power Iteration
• Before we apply momentum, we need to choose what base algorithm
we’re using.
• Simplest algorithm: power iteration
• Repeatedly multiply by the matrix A to get an answer
xt+1 = Axt
Why does Power Iteration Work?
n
X
• Let eigendecomposition of A be A = u u
i i i
T
i=1
• For 1 > 2 1 ··· n
• PI converges in direction because cosine-squared of angle to u1 is
T 2 T t 2 2t T 2
(u x t ) (u A x 0 ) (u x
1 0 )
cos2 (✓) = 1 2 = 1 t = P n
1
2t (uT x )2
kxt k kA x0 k2 i=1 i i 0
Pn ✓ ◆ !
2t T 2 2t
i=2 i (u x
i 0 ) 2
= 1 Pn 2t (uT x )2
= 1 ⌦
i=1 i i 0 1
What about a more general algorithm?
• Use both current iterate, and history of past iterations
xt+1 = ↵t Axt + t,1 xt 1 + t,2 xt 2 + ··· + t,t x0
• for fixed parameters ⍺ and β
• What class of functions can we express in this form?
• Notice: xt is always a degree-t polynomial in A times x0
• Can prove by induction that we can express ANY polynomial
Power Iteration and Polynomials
• Can also think of power iteration as a degree-t polynomial of A
x t = At x 0
• Is there a better degree-t polynomial to use than ft (x) = xt ?
• If we use a different polynomial, then we get
n
X
xt = ft (A)x0 = ft ( i )ui uTi x0
i=1
• Ideal solution: choose polynomial with zeros at all non-dominant eigenvalues
• Practically, make ft ( 1 ) to be as large as possible and |ft ( )| < 1 for all | | < 2
Chebyshev Polynomials Again
• It turns out that Chebyshev polynomials solve this problem.
• Recall: T0 (x) = 0, T1 (x) = x and
Tn+1 (x) = 2xTn (x) Tn 1 (x)
• Nice properties:
|x| 1 ) |Tn (x)| 1
Chebyshev Polynomials
1.5
0.5
0
T0 (u) = 1
-0.5
-1
-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Chebyshev Polynomials
1.5
0.5
0
T1 (u) = u
-0.5
-1
-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Chebyshev Polynomials
1.5
0.5
0
T2 (u) = 2u2 1
-0.5
-1
-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Chebyshev Polynomials
1.5
0.5
-0.5
-1
-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Chebyshev Polynomials
1.5
0.5
-0.5
-1
-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Chebyshev Polynomials
1.5
0.5
-0.5
-1
-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Chebyshev Polynomials
1.5
0.5
-0.5
-1
-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Chebyshev Polynomials Again
• It turns out that Chebyshev polynomials solve this problem.
• Recall: T0 (x) = 0, T1 (x) = x and
Tn+1 (x) = 2xTn (x) Tn 1 (x)
• Nice properties:
⇣⇣ p ⌘n ⌘
|x| 1 ) |Tn (x)| 1 Tn (1 + ✏) ⇡ ⇥ 1 + 2✏
Using Chebyshev Polynomials
• So we can choose our polynomial f in terms of T
• Want: ft ( 1 ) to be as large as possible, subject to |ft ( )| < 1 for all | | < 2
• To make this work, set ✓ ◆
x
fn (x) = Tn
2
• Can do this by running the update
✓ ◆
2A A
xt+1 = xt xt 1 ) x t = Tt x0
2 2
Convergence of Momentum PCA
Pn ⇣ ⌘
T
i=1 Tt
u u x
i
xt 2
i i 0
=r ⇣ ⌘
kxt k Pn 2 T x )2
T
i=1 t
i
2
(u i 0
• Cosine-squared of angle to dominant component:
⇣ ⌘
(uT1 xt )2 Tt2
2
(u T
x
1 0 )12
2 ⇣ ⌘
cos (✓) = =P
kxt k2 n
T 2 i
(u T x )2
i=1 t 2 i 0
Convergence of Momentum PCA (continued)
⇣ ⌘
(uT1 xt )2 Tt2 2
1
(u T
1 0x ) 2
cos2 (✓) = =P ⇣ ⌘
kxt k2 n
T 2 i
(u T x )2
i=1 t 2 i 0
Pn ⇣ ⌘
2 T 2
T
i=2 t
i
2
(u x
i 0 )
=1 Pn ⇣ ⌘
T 2 i
(u T x )2
i=1 t 2 i 0
Pn T 2
✓ ✓ ◆◆
i=2 (ui x0 ) 2 1
1 ⇣ ⌘ = 1 ⌦ Tt
Tt2 12 (uT1 x0 )2 2
Convergence of Momentum PCA (continued)
✓ ✓ ◆◆ ✓ ✓ ◆◆
2 1 2 1 2
cos2 (✓) 1 ⌦ Tt =1 ⌦ Tt 1+
2 2
0 r ! 1 2t
1 2
=1 ⌦@ 1 + 2 A
2
• Recall that standard power iteration had:
✓ ◆2t ! ✓ ◆ 2t
!
2 1 2
cos2 (✓) = 1 ⌦ =1 ⌦ 1+
1 2
• So the momentum rate is asymptotically faster than power iteration
Questions?
The Kernel Trick, Gram
Matrices, and Feature Extraction
CS6787 Lecture 4 — Fall 2017
Basic Linear Models
• For classification using model vector w
T
output = sign(w x)
• Optimization methods vary; here’s logistic regression (yi 2 { 1, 1})
Xn
1 T
minimizew log 1 + exp( w xi yi )
n i=1
Benefits of Linear Models
• Fast classification: just one dot product
• Fast training/learning: just a few basic linear algebra operations
• Drawback: limited expressivity
• Can only capture linear classification boundaries à bad for many problems
• How do we let linear models represent a broader class of decision
boundaries, while retaining the systems benefits?
The Kernel Method
• Idea: in a linear model we can think about the similarity between two
training examples x and y as being
T
x y
• This is related to the rate at which a random classifier will separate x and y
• Kernel methods replace this dot-product similarity with an arbitrary
Kernel function that computes the similarity between x and y
K(x, y) : X ⇥ X ! R
Kernel Properties
• What properties do kernels need to have to be useful for learning?
• Key property: kernel must be symmetric K(x, y) = K(y, x)
• Key property: kernel must be positive semi-definite
n X
X n
8ci 2 R, xi 2 X , ci cj K(xi , xj ) 0
i=1 j=1
• Can check that the dot product has this property
Facts about Positive Semidefinite Kernels
• Sum of two PSD kernels is a PSD kernel
K(x, y) = K1 (x, y) + K2 (x, y) is a PSD kernel
• Product of two PSD kernels is a PSD kernel
K(x, y) = K1 (x, y)K2 (x, y) is a PSD kernel
• Scaling by any function on both sides is a kernel
K(x, y) = f (x)K1 (x, y)f (y) is a PSD kernel
Other Kernel Properties
• Useful property: kernels are often non-negative
K(x, y) 0
• Useful property: kernels are often scaled such that
K(x, y) 1, and K(x, y) = 1 , x = y
• These properties capture the idea that the kernel is expressing the similarity
between x and y
Common Kernels
• Gaussian kernel/RBF kernel: de-facto kernel in machine learning
K(x, y) = exp kx yk2
• We can validate that this is a kernel
• Symmetric? ✅
• Positive semi-definite? ✅ WHY?
• Non-negative? ✅
• Scaled so that K(x,x) = 1? ✅
Common Kernels (continued)
• Linear kernel: just the inner product K(x, y) = xT y
• Polynomial kernel: K(x, y) = (1 + xT y)p
• Laplacian kernel: K(x, y) = exp ( kx yk1 )
• Last layer of a neural network:
if last layer outputs (x), then kernel is K(x, y) = (x)T (y)
Classifying with Kernels
• An equivalent way of writing a linear model on a training set is
0 ! 1
n T
X
output(x) = sign @ w i xi xA
i=1
• We can kernel-ize this by replacing the dot products with kernel evals
n
!
X
output(x) = sign wi K(xi , x)
i=1
Learning with Kernels
• An equivalent way of writing linear-model logistic regression is
0 0 0 1T 11
Xn Xn
1 B B CC
minimizew log @1 + exp @ @ w j x j A x i y i AA
n i=1 j=1
• We can kernel-ize this by replacing the dot products with kernel evals
0 0 11
X n Xn
1
minimizew log @1 + exp @ wj yi K(xj , xi )AA
n i=1 j=1
The Computational Cost of Kernels
• Recall: benefit of learning with kernels is that we can express a wider
class of classification functions
• Recall: another benefit is linear classifier learning problems are
“easy” to solve because they are convex, and gradients easy to compute
• Major cost of learning naively with Kernels: have to evaluate K(x, y)
• For SGD, need to do this effectively n times per update
• Computationally intractable unless K is very simple
The Gram Matrix
• Address this computational problem by pre-computing the kernel
function for all pairs of training examples in the dataset.
Gi,j = K(xi , xj )
• Transforms the learning problem into
Xn
1
minimizew log 1 + exp yi eTi Gw
n i=1
• This is much easier than recomputing the kernel at each iteration
Problems with the Gram Matrix
• Suppose we have n examples in our training set.
• How much memory is required to store the Gram matrix G?
• What is the cost of taking the product Gi w to compute a gradient?
• What happens if we have one hundred million training examples?
Feature Extraction
• Simple case: let’s imagine that X is a finite set {1, 2, …, k}
• We can define our kernel as a matrix M 2 Rk⇥k
Mi,j = K(i, j)
• Since M is positive semidefinite, it has a square root U T U = M
k
X
Uk,i Uk,j = Mi,j = K(i, j)
i=1
Feature Extraction (continued)
• So if we define a feature mapping (i) = U ei then
k
X
(i)T (j) = Uk,i Uk,j = Mi,j = K(i, j)
i=1
• The kernel is equivalent to a dot product in some space
• In fact, this is true for all kernels, not just finite ones
Classifying with feature maps
• Suppose that we can find a finite-dimensional feature map that satisfies
T
(i) (j) = K(i, j)
• Then we can simplify our classifier to
n
!
X
output(x) = sign wi K(xi , x)
i=1
n
!
X
= sign wi (xi )T (x) = sign uT (x)
i=1
Learning with feature maps
• Similarly we can simplify our learning objective to
Xn
1 T
minimizeu log 1 + exp u (xi )yi
n i=1
• Take-away: this is just transforming the input data, then running a
linear classifier in the transformed space!
• Computationally: super efficient
• As long as we can transform and store the input data in an efficient way
Problems with Feature Maps
• The dimension of the transformed data may be much larger than the
dimension of the original data.
• Suppose that the feature map is : Rd ! RD and there are n examples
• How much memory is needed to store the transformed features?
• What is the cost of taking the product uT (xi ) to compute a gradient?
Feature Maps vs. Gram Matrices
• Systems trade-offs exist here.
• When number of examples gets very large, feature maps are better.
• When transformed feature vectors have high dimensionality, Gram
matrices are better.
Another Problem with Feature Maps
• Recall: I said there was always a feature map for any kernel such that
T
(i) (j) = K(i, j)
• But this feature map is not always finite-dimensional
• For example, the Gaussian/RBF kernel has an infinite-dimensional feature map
• Many kernels we care about in ML have this property
• What do we do if ɸ has infinite dimensions?
• We can’t just compute with it normally!
Solution: Approximate Feature Maps
• Find a finite-dimensional feature map so that
T
K(x, y) ⇡ (x) (y)
• Typically, we want to find a family of feature maps ɸt such that
D : Rd ! RD
T
lim D (x) D (y) = K(x, y)
D!1
Types of Approximate Feature Maps
• Deterministic feature maps
• Choose a fixed-a-priori method of approximating the kernel
• Generally not very popular because of the way they scale with dimensions
• Random feature maps
• Choose a feature map at random (typically each feature is independent) such that
⇥ T
⇤
E (x) (y) = K(x, y)
• Then prove with high probability that over some region of interest
T
(x) (y) K(x, y) ✏
Types of Approximate Features (continued)
• Orthogonal randomized feature maps
• Intuition behind this: if we have a feature map where for some i and j
eTi (x) ⇡ eTj (x)
then we can’t actually learn much from having both features.
• Strategy: choose the feature map at random, but subject to the constraint that the
features be “orthogonal” in some way.
• Quasi-random feature maps
• Generate features using a low-discrepancy sequence rather than true randomness
Adaptive Feature Maps
• Everything before this didn’t take the data into account
• Adaptive feature maps look at the actual training set and try to minimize
the kernel approximation error using the training set as a guide
• For example: we can do a random feature map, and then fine-tune the
randomness to minimize the empirical error over the training set
• Gaining in popularity
• Also, neural networks can be thought of as adaptive feature maps.
Systems Tradeoffs
• Lots of tradeoffs here
• Do we spend more work up-front constructing a more sophisticated
approximation, to save work on learning algorithms?
• Would we rather scale with the data, or scale to more complicated
problems?
• Another task for metaparameter optimization
Questions
• Upcoming things:
• Paper 2 review due tonight
• Paper 3 in class on Wednesday
• Start thinking about the class project — it will come faster than you think!