M03 Clustering
M03 Clustering
200
150
100
50
-50
-100
-150
-200
-250
2
Motivating Problems
The Lloyds-Max algorithm
First of all, we need to save the bits
What if there’s a lack of bits? 8bits 4bits
• Some values are misrepresented
Assume that the raw audio samples are following a distribution like:
1.2
0.8
pdf
0.6
0.4
0.2
0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Which do you prefer between the red boundaries and blue boundaries (they are all for 2 bit encoding)?
3
Motivating Problems
The Lloyds-Max algorithm
You prefer the blue boundaries because they go well with the underlying structure of the sample
distribution
Underlying structure? 1.2
0.8
pdf
0.6
0.4
=-0.6, =0.1
0.2 =-0.3, =0.1
=0.25, =0.1
=0.5, =0.1
mixture
0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
4
k-Means Clustering
A scalar case
Where’s the distortion from?
Let’s start from the red boundaries we don’t like
The discrepancy between the representative and the actual samples
We want to find the representative that creates the least discrepancy
• For each quantization level 1.2
0.8
pdf
0.6
=-0.6, =0.1
0.4 =-0.3, =0.1
=0.25, =0.1
=0.5, =0.1
mixture
0.2
0
-1 ⇤⇥
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
✓1
5
k-Means Clustering
A scalar case
1.2
So, the objective for j-th range is to find the representative value
0.8
X
that minimizes the error
pdf
0.6
✓j i2Cj
=0.25,
=0.5,
=0.1
=0.1
mixture
0.2
6
k-Means Clustering
A scalar case
Are we done?
No, we assumed that the boundaries are correct, but they aren’t
1.2
pdf 0.8
0.6
=-0.6, =0.1
0.4 =-0.3, =0.1
=0.25, =0.1
=0.5, =0.1
mixture
0.2
0
-1 -0.8 ⇤
-0.6 -0.4
⇤ -0.2 0
⇤
0.2 0.4
⇤
0.6 0.8 1
7
k-Means Clustering
A scalar case
We need to optimize w.r.t. the membership matrix as well
J X
X N
arg min uij ||xi ✓j ||2
✓ ,U j=1 i=1
8
k-Means Clustering
A scalar case
Let’s get back to the CD encoding problem (Lloyd-Max algorithm)
Now instead of all the possible real values between -1 and +1
We replace the values within each range with their corresponding representatives, i.e. the means.
1.2
1
4bits (Lloyds) 4bits (Uniform)
0.8
pdf
0.6
=-0.6, =0.1
8bits (Lloyds) 8bits (Uniform)
0.4 =-0.3, =0.1
=0.25, =0.1
=0.5, =0.1
mixture
0.2
0
-1 -0.8
⇤
-0.6 -0.4
⇤ -0.2 0
⇤
0.2 0.4
⇤ 0.6 0.8 1
Original
9
Motivating Problems
Black cat, red wall, gray ground
Now let’s move on to the multi-dimensional case
In general, how do we quantize a vector?
First off, can you (verbally) describe this picture?
10
Motivating Problems
Black cat, red wall, gray ground
k-means with three clusters
11
Motivating Problems
Black cat, red wall, gray ground
k-means with 8 clusters and 16 clusters
Algorithm-wise everything is the same except for the fact that the input samples are 3D (RGB) vectors
J X
X N J X
X N
arg min uij ||xi ✓j ||2 arg min uij ||xi ✓ j ||2
✓ ,U j=1 i=1 ⇥,U j=1 i=1
12
Motivating Problems
Black cat, red wall, gray ground
✓2 xi
✓1
✓3
13
Vector Quantization
Clustering on multi-dimensional samples
What we did is something called Vector Quantization (VQ)
Do clustering
Replace vector samples with the mean of the cluster they belong to
What we need:
A good clustering
• A small number of means that are representative enough
Dictionary (codebook)
• A codeword corresponds to one of the means
Index to the codebook
• Index of the pixel-wise membership
14
Gaussian Mixture Model
What’s wrong with k-means?
What I don’t like about k-means
Euclidean distance, hard decision, equiprobable clusters, diagonal cov…
3 3
2 2
1 1
0 0
-1 -1
-2 -2
4 4
-3 -3
-4 -4 3 3
-4 -3 -2 -1 0 1 2 3 4 5 -4 -3 -2 -1 0 1 2 3 4 5
1 1
0 0
-1 -1
-2 -2
-3 -2 -1 0 1 2 3 4 5 -3 -2 -1 0 1 2 3 4 5
15
Gaussian Mixture Model
An alternative: Mahalanobis distance
Cluster 1 or 2?
Let’s tweak k-means
First, let’s take variance into account for the distance metric
Mahalanobis distance:
s
(xi µj ) 2 µj Mean of j-th cluster µ1 xi µ2
DM (xi ||µj ) = 2
j j Standard dev. of j-th cluster 3
Multi-dimensional cases
q with covariance:
2
DM (xi ||µj ) = (xi µj )> ⌃ 1 (xi µj )
For a 2D Gaussian with
⌃=
1 0.7 1
⇥
0.7 1
(1,1) p 0
• Euclidean: 2
• Mahalanobis: 1.0847 -1
⇥
(1,-1) p
• Euclidean: 2 -2
• Mahalanobis: 2.5820
-3
-3 -2 -1 0 1 2 3
16
Gaussian Mixture Model
Maximum Likelihood
Mixture of Gaussians (MoG) or Gaussian Mixture Model (GMM)
A maximum likelihood problem
• Given the data, find the best fit among the family of prob. distributions with a certain parametric form
4 4
3 3
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3
-4 -3 -2 -1 0 1 2 3 4 5 -4 -3 -2 -1 0 1 2 3 4 5
17
Gaussian Mixture Model
Maximum Likelihood
We know how to solve a maximum likelihood problem:
N
Y
arg max p(xi ; ⇥)
⇥ i=1
For the GMM case,
we can break down the likelihood as follows:
N X
Y J
j=1 j=2
L= Pj N (xi ; µj , ⌃j )
i=1 j=1
Because,
J
X
p(xi ; ⇥) = Pj N (xi ; µj , ⌃j )
j=1
Note that:
⇥ = {P1 , µ1 , ⌃1 , P2 , µ2 , ⌃2 , · · · , PJ , µJ , ⌃J }
18
Gaussian Mixture Model
Maximum Likelihood
We also know the p.d.f. of a Gaussian: ⇣ ⌘
1 1 > 1
N (xi ; µj , ⌃j ) = exp (xi µj ) ⌃j (xi µj )
(2⇡)D/2 |⌃j |1/2 2
Therefore, the likelihood is
N X
Y J
L= Pj N (xi ; µj , ⌃j )
i=1 j=1
!
N X
Y J
1 ⇣ 1 ⌘
= Pj exp (xi µj )> ⌃j 1 (xi µj )
i=1 j=1
(2⇡)D/2 |⌃j |1/2 2
19
Gaussian Mixture Model
Maximum Likelihood
The objective function:
✓X
J ◆
arg max LL + Pj 1
⇥ j=1
N J
! ✓X
J ◆
X X
arg max log Pj N (xi ; µj , ⌃j ) + Pj 1
⇥ i=1 j=1 j=1
Differentiation is difficult
Why?
• Because of the summation inside the logarithm
20
Gaussian Mixture Model
Jensen’s Inequality
For example ✓
1 2
◆
1 2
f x1 + x2 f (x1 ) + f (x2 )
3 3 3 3
f (x1 ) ⇥
⇥
⇥ f (x2 )
⇥
x1 1 2
x1 + x2
x2
3 3
✓P ◆ P
ax a f (xi )
For a concave function f : f Pi i Pi
ai ai
P P P
Or: f ai xi ai f (xi ) if ai = 1 and ai 0
2
Logarithmic functions are concave 0
ln p k
Why? 1 -2
0
log (x) =
x -4
1 0 5 10 15 20
log00 (x) = pk
x2
21
Gaussian Mixture Model
Expectation Maximization (EM)
Let’s get back to the ML problem for GMM
!
N
X J
X Pj N (xi ; µj , ⌃j )U ij X
LL = log U ij = 1 and U ij 0
i=1 j=1
U ij j
N XJ
!
X Pj N (xi ; µj , ⌃j )
U ij log Jensen's inequality
i=1 j=1
U ij
N X
J
!
X p(j|xi )p(xi ) p(xi |j)p(j) N (xi ; µj , ⌃j )Pj
= U ij log * p(j|xi ) = P =
i=1 j=1
U ij j p(xi |j)p(j) p(xi )
N X J
! J
X p(j|xi ) X
= U ij log + U ij log p(xi )
i=1 j=1
U ij j=1
N X
J
! N
X p(j|xi ) X
= U ij log + log p(xi ) = DKL U ij p(j|xi ) + log p(xi )
i=1 j=1
U ij
i=1
22
Gaussian Mixture Model ⇣ ⌘
Expectation Maximization (EM) N (xi ; µj , ⌃j ) =
1
exp
1
(xi > 1
µj ) ⌃j (xi µj )
(2⇡)D/2 |⌃j |1/2 2
M-step
PJ
We find ⇥ that maximizes LL+ ( j=1 Pj 1)
<latexit sha1_base64="cJqTnKS5Shxu4mVhENpM7OMbCuI=">AAACQnicZZDLbhMxFIY95dISbgGWbCwipCJKMoMq0Q0oEl0gxCJIpK1Uh5HHcyZx4svIPlMRjeZ1+iRdsi0SrwArxJYFzjQLSo9k6/Pv8x/Zf1Yq6TGOv0cb167fuLm5datz+87de/e7Dx4eeFs5AWNhlXVHGfegpIExSlRwVDrgOlNwmC3eru4PT8B5ac0nXJYw0XxqZCEFxyCl3WHNvHCyRI9LBZQJrj6023OmwpScMwUFbjNf6bSev06az+/pKJ2/SJiT0xk+a9JuL+7HbdGrkKyhR9Y1Srs/WW5FpcGgUNz74yQucVJzh1IoaDqs8lByseBTOA5ouAY/qdufNvRpUHJaWBeWQdqq/zpqrr1f6myHBtAcZzs008G2Qn95NBZ7k1qaskIw4mJyUSmKlq5yorl0IFAtA/AQUHgcFTPuuMCQZoe1xnow9uE00NLMYSH1YN/ZMrNfBjkUfQ/YdEI6yf9ZXIWDl/0k7icfd3vDN+uctshj8oRsk4S8IkPyjozImAhySr6Sc/ItOot+RL+i3xetG9Ha84hcqujPX/PHsEU=</latexit>
N X
J
!
X Pj N (xi ; µj , ⌃j )
LL U ij log
i=1 j=1
U ij
XN X J N X
X J constant
= U ij log Pj N (xi ; µj , ⌃j ) U ij log U ij
i=1 j=1 i=1 j=1
1⇣ 1 ⇣ N
X 11 ⌘
⌘
1
N (x
arg max ;µ
Ji⌃ , , ⌃Jj⌃
) j== U ij exp
log |⌃j | (x
(xi µjj))>
µ >
⌃
⌃j (xi µj ) + const.
⌃j
j j (2⇡) D/2 |⌃ j | 1/2
2 22 i
i=1
23
Gaussian Mixture Model
Expectation Maximization (EM)
M-step
Partial differentiation w.r.t. the parameters and find the local maxima
• For the means:
X N PN XN ⇣ ⌘
@Jµj U ij x i
1 1 1 >
> 1
1
= U ij xi µj = 0, µj = Pi=1 N (xi ; µj , ⌃j )arg
= max JD/2
µj , Jµ 1/2
=exp U ij(xx
ii µ
µjj) ⌃ (xii µµj j )+ const.
⌃j (x
@µj N µ (2⇡) |⌃ j | j
2 2
i=1 U ij i=1
j
i=1
• For the priors: PN N
X X
@JPj X @JPj
=( Pj 1) = 0 i=1 U ij arg max JPj , J Pj = U ij log Pj + ( Pj 1) + const.
= + =0
@ j
@Pj Pj Pj i=1 j
J X
X N J
X
, U ij = Pj
j=1 i=1 j=1
J X
X N
, = U ij = N
j=1 i=1
PN
i=1 U ij
, Pj =
N
P
• For the covariance (see matrixcookbook 2.1.2 and 2.2): i U ij (xi µj )(xi µj )>
⌃j = P
i U ij
24
Gaussian Mixture Model
Expectation Maximization (EM)
E-step: calculate posterior probabilities
Pj N (xi ; µj , ⌃j )
U ij = p(j|xi ) = P
j Pj N (xi ; µj , ⌃j )
25
Gaussian Mixture Model
Too much math?
26
Mixture of Multinomial Distributions
Clustering Musical Notes
How many clusters? (samples are magnitude spectra)
I’m curious what kind of notes are there in the signal
8000
7000
6000
5000
Freq (Hz)
4000
3000
2000
1000
0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3 3.3 3.6 3.9 4.2 4.5
Time (sec)
27
Mixture of Multinomial Distributions
Clustering Musical Notes
For i-th spectrum xi :
Although the magnitudes are real numbers we can scale them to convert them into integers:
[1.2, 1.35, 0.1, 5.525] = 0.025 ⇥ [48, 54, 4, 221]
Then, we can think of this as an observation from a multinomial dist.
N ! Y xid
M(xi ; ✓) = Q ✓d
d xid ! d
EM for mixture of multinomial distributions
Initialize two mean spectra (random numbers) ✓ 1 , ✓ 2 2 RD
+
Initialize two prior prob (random numbers that sum to one) P1 + P2 = 1
Calculate posterior prob (E-step)
P1 M(xi ; ✓ 1 ) It’s usually fine not to convert the spectra
p(j = 1|xi ) = into integers, although it’s not strictly correct.
P1 M(xi ; ✓ 1 ) + P2 M(xi ; ✓ 2 ) In this example I just normalized each spectrum.
Q xid Q xid
P1 Q Nx!id ! d ✓1d P
d
Q 1 d ✓1dQ
= Q xid Q xid = xid xid
P1 Q Nx!id ! d ✓1d + P2 Q Nx!id ! d ✓2d P1 d ✓1d + P2 d ✓2d
d d
Update means (M-step)
28
Mixture of Multinomial Distributions
Clustering Musical Notes
It’s actually a difficult clustering task with a lot of spurious local minima
nfft=1024, hop=256 nfft=4096, hop=512
8000 8000
7000 7000
6000 6000
5000 5000
Freq (Hz)
Freq (Hz)
4000 4000
3000 3000
2000 2000
1000 1000
0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3 3.3 3.6 3.9 4.2 4.5 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3 3.3 3.6 3.9 4.2 4.5
Time (sec) Time (sec)
0.6 0.6
0.5 0.5
0.4 0.4
29
Locality Sensitive Hashing
From clustering to hashing
Hashing is a popular concept in database
There is a query to the database
Instead of comparing the original representations, a hash function maps the query down to an integer (binary) address
The address is associated with a bucket. It can contain a few different database records
• We say that those records collide
Then, we refine the search inside the bucket
This is cheaper than seeing the entire database
Traditional challenges
Records are better off evenly distributed (for the speed)
Overflow
30
Locality Sensitive Hashing
From clustering to hashing
There’s another hashing concept in machine learning
Locality sensitive hashing or semantic hashing
For the data points xi and xj in a D-dimensional space
If they are close enough D(xi ||xj ) < ⌧
Then, the Hamming distance between them after hashing is zero H (xi )|| (xj ) = 0 with probability p
If they are far enough D(xi ||xj ) c⌧
Then, the Hamming distance between them after hashing is zero H (xi )|| (xj ) = 0 with probability q
p>q !
The hash function (xi ) that meets above conditions are said to be in the locality sensitive hash
function family
In other words…
Originally similar items collide in the same bucket
Share the same address
Quantized using the same binary string
Are in the same cluster
31
Locality Sensitive Hashing
From clustering to hashing
How do we find the hash function?
Well, it’s not easy
32
8000
7000
5000
Freq (Hz)
4000
2000
1000
Let’s see how the original spectra are similar to each other 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4
Time (sec)
2.7 3 3.3 3.6 3.9 4.2 4.5
50 50 50
50 100 150 200 250 50 100 150 200 250 50 100 150 200 250
50 50 50
50 100 150 200 250 50 100 150 200 250 50 100 150 200 250
33
Spectral Hashing
More machine learning involved
Just another hashing technique, but it tries to minimize the difference between
X
yi = 0 5000
Freq (Hz)
i 4000
y>
i yj = 0 if i 6= j
3000
-1
0 50 100 150 200 250
Weiss, Yair, Antonio Torralba, and Rob Fergus. "Spectral hashing." Advances in neural information processing systems. 2009.
34
Locality Sensitive Hashing
Why is it useful?
For faster detection
Matching hash codes x̃t (Xt )
Query (Q) ! q̃
DB of millions of items
0.6 0.6
0.4 0.4
0.2 0.2
Hash; AUC=0.91 Hash; AUC=0.69
HMM; AUC=0.71 HMM; AUC=0.21
0 0
0 1 2 3 4 5 0 1 2 3 4 5
False Positive Rate (FP / min) False Positive Rate (FP / min)
35
Reading
Textbook 6.8 – 6.16
Textbook 2.5.5
Bishop, “Pattern Recognition and Machine Learning” Chapter 9
36
Thank You!
37