0% found this document useful (0 votes)

30 views37 pages

M03 Clustering

The document discusses clustering techniques in machine learning, specifically focusing on k-means clustering and its application in signal processing, such as audio encoding. It highlights the Lloyd-Max algorithm for quantization and introduces Gaussian Mixture Models (GMM) as an alternative to k-means, addressing its limitations. The lecture emphasizes the importance of optimizing membership matrices and using Mahalanobis distance for better clustering results.

Uploaded by

mpeducation2025

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views37 pages

M03 Clustering

Uploaded by

mpeducation2025

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

ENGR-E 511; ENGR-E 399

“Machine Learning for Signal Processing”

Module 03: Lecture 01:
Clustering
Minje Kim
Department of Intelligent Systems Engineering
Email: [email protected]
Website: http://minjekim.com
Research Group: http://saige.sice.indiana.edu
Meeting Request: http://doodle.com/minje
Motivating Problems
CD
How do we represent music in a CD?
What? What is CD?
44.1 kHz, 16 bit, LPCM
What does it mean? 1/44100 sec
250

200

150

100

-50

-100

-150

-200

-250

We sample from the (continuous) waveform at every 1/44100 second Time

Each sample is represented with one of 216=65536 values

• e.g. 0000 0000 0000 1000 : -0.9998
• e.g. 1111 1111 1101 1100 : +0.9989
Distribution
• e.g. 1000 0000 0000 0000 : 0 of the samples
Can we do better? -1 0 1
• (Real-valued) samples in the same block is represented by the same value (and I don’t like it)

2
Motivating Problems
The Lloyds-Max algorithm
First of all, we need to save the bits
What if there’s a lack of bits? 8bits 4bits
• Some values are misrepresented

Assume that the raw audio samples are following a distribution like:
1.2

0.8

pdf
0.6

0.4

0.2

0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Which do you prefer between the red boundaries and blue boundaries (they are all for 2 bit encoding)?

3
Motivating Problems
The Lloyds-Max algorithm
You prefer the blue boundaries because they go well with the underlying structure of the sample
distribution
Underlying structure? 1.2

0.8

pdf
0.6

0.4

=-0.6, =0.1
0.2 =-0.3, =0.1
=0.25, =0.1
=0.5, =0.1
mixture

0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

And they sound better!

4bits (Lloyds) 4bits (Uniform) 8bits (Lloyds) 8bits (Uniform) Original

4
k-Means Clustering
A scalar case
Where’s the distortion from?
Let’s start from the red boundaries we don’t like
The discrepancy between the representative and the actual samples
We want to find the representative that creates the least discrepancy
• For each quantization level 1.2

How do we measure the amount of discrepancy?

0.8

pdf
0.6

=-0.6, =0.1
0.4 =-0.3, =0.1
=0.25, =0.1
=0.5, =0.1
mixture
0.2

0
-1 ⇤⇥
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

✓1

5
k-Means Clustering
A scalar case
1.2

So, the objective for j-th range is to find the representative value
0.8

X
that minimizes the error

pdf
0.6

arg min ||xi ✓j ||2 0.4

=-0.6,
=-0.3,
=0.1
=0.1

✓j i2Cj
=0.25,
=0.5,
=0.1
=0.1
mixture
0.2

For all examples that belong to the j-th range 0

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

• Cj holds the indices for j-th range P

1
So, what’s the solution? (why?) ✓j = |Cj | i2Cj xi C1
Eventually, we need to do this for all J ranges
J X j=1 j=J
X
arg min ||xi ✓j ||2 i=1 1 0 0 0
✓1 ,✓2 ,··· ,✓J j=1 i2C
j 0 0 0 1

Let me put it in another way 1 0 0 0

4th sample belongs
J X
X N 0 1 0 0
to the 2nd cluster
arg min uij ||xi ✓j ||2 0 0 1 0
✓1 ,✓2 ,··· ,✓J j=1 i=1
… … … …
Where uij is a matrix that indicates the membership i=N 0 1 0 0
• e.g. uij = 1 then i 2 Cj X
The membership matrix uij should also meet another constraint: uij = 1
j

6
k-Means Clustering
A scalar case
Are we done?
No, we assumed that the boundaries are correct, but they aren’t

1.2

pdf 0.8

0.6

=-0.6, =0.1
0.4 =-0.3, =0.1
=0.25, =0.1
=0.5, =0.1
mixture
0.2

0
-1 -0.8 ⇤
-0.6 -0.4
⇤ -0.2 0
⇤
0.2 0.4
⇤
0.6 0.8 1

In other words, we don’t know if the uij matrix is correct

7
k-Means Clustering
A scalar case
We need to optimize w.r.t. the membership matrix as well
J X
X N
arg min uij ||xi ✓j ||2
✓ ,U j=1 i=1

The k-means clustering algorithm on scalar samples

Initialize the means ✓ = {✓1 , ✓2 , · · · , ✓J }> with random numbers
Update the membership matrix
• This is actually a complicated optimization problem (why?), but the solution is simple
• For a fixed set of means ⇢
1 if j = arg minj 0 ||xi ✓j 0 ||2
uij =
0 otherwise
P
Update the means ✓j = |C1j | i2Cj xi
1
PN
= |Cj | i=1 uij xi

8
k-Means Clustering
A scalar case
Let’s get back to the CD encoding problem (Lloyd-Max algorithm)
Now instead of all the possible real values between -1 and +1
We replace the values within each range with their corresponding representatives, i.e. the means.
1.2

1
4bits (Lloyds) 4bits (Uniform)

0.8
pdf

0.6

=-0.6, =0.1
8bits (Lloyds) 8bits (Uniform)
0.4 =-0.3, =0.1
=0.25, =0.1
=0.5, =0.1
mixture
0.2

0
-1 -0.8
⇤
-0.6 -0.4
⇤ -0.2 0
⇤
0.2 0.4
⇤ 0.6 0.8 1
Original

9
Motivating Problems
Black cat, red wall, gray ground
Now let’s move on to the multi-dimensional case
In general, how do we quantize a vector?
First off, can you (verbally) describe this picture?

10
Motivating Problems
Black cat, red wall, gray ground
k-means with three clusters

11
Motivating Problems
Black cat, red wall, gray ground
k-means with 8 clusters and 16 clusters

Algorithm-wise everything is the same except for the fact that the input samples are 3D (RGB) vectors
J X
X N J X
X N
arg min uij ||xi ✓j ||2 arg min uij ||xi ✓ j ||2
✓ ,U j=1 i=1 ⇥,U j=1 i=1

12
Motivating Problems
Black cat, red wall, gray ground

✓2 xi

✓1

✓3

13
Vector Quantization
Clustering on multi-dimensional samples
What we did is something called Vector Quantization (VQ)
Do clustering
Replace vector samples with the mean of the cluster they belong to

In image, we had 3D vectors (RGB)

If we assume J clusters, we need log2J bits to encode a pixel
• Rather than 8 bits per pixel

What we need:
A good clustering
• A small number of means that are representative enough
Dictionary (codebook)
• A codeword corresponds to one of the means
Index to the codebook
• Index of the pixel-wise membership

14
Gaussian Mixture Model
What’s wrong with k-means?
What I don’t like about k-means
Euclidean distance, hard decision, equiprobable clusters, diagonal cov…
3 3

2 2

1 1

0 0

-1 -1

-2 -2

4 4
-3 -3

-4 -4 3 3

-4 -3 -2 -1 0 1 2 3 4 5 -4 -3 -2 -1 0 1 2 3 4 5

Input data with GT means K-means clustering results 2 2

1 1

0 0

-1 -1

-2 -2

-3 -2 -1 0 1 2 3 4 5 -3 -2 -1 0 1 2 3 4 5

Input data with GT means K-means clustering results

15
Gaussian Mixture Model
An alternative: Mahalanobis distance
Cluster 1 or 2?
Let’s tweak k-means
First, let’s take variance into account for the distance metric
Mahalanobis distance:
s
(xi µj ) 2 µj Mean of j-th cluster µ1 xi µ2
DM (xi ||µj ) = 2
j j Standard dev. of j-th cluster 3

Multi-dimensional cases
q with covariance:
2
DM (xi ||µj ) = (xi µj )> ⌃ 1 (xi µj )
For a 2D Gaussian with 
⌃=
1 0.7 1
⇥
0.7 1
(1,1) p 0

• Euclidean: 2
• Mahalanobis: 1.0847 -1
⇥
(1,-1) p
• Euclidean: 2 -2

• Mahalanobis: 2.5820
-3
-3 -2 -1 0 1 2 3

16
Gaussian Mixture Model
Maximum Likelihood
Mixture of Gaussians (MoG) or Gaussian Mixture Model (GMM)
A maximum likelihood problem
• Given the data, find the best fit among the family of prob. distributions with a certain parametric form

Which one is a better fit?

4 4

3 3

2 2

1 1

0 0

-1 -1

-2 -2

-3 -3

-4 -3 -2 -1 0 1 2 3 4 5 -4 -3 -2 -1 0 1 2 3 4 5

17
Gaussian Mixture Model
Maximum Likelihood
We know how to solve a maximum likelihood problem:
N
Y
arg max p(xi ; ⇥)
⇥ i=1
For the GMM case,
we can break down the likelihood as follows:
N X
Y J
j=1 j=2
L= Pj N (xi ; µj , ⌃j )
i=1 j=1

Because,
J
X
p(xi ; ⇥) = Pj N (xi ; µj , ⌃j )
j=1

Note that:
⇥ = {P1 , µ1 , ⌃1 , P2 , µ2 , ⌃2 , · · · , PJ , µJ , ⌃J }

18
Gaussian Mixture Model
Maximum Likelihood
We also know the p.d.f. of a Gaussian: ⇣ ⌘
1 1 > 1
N (xi ; µj , ⌃j ) = exp (xi µj ) ⌃j (xi µj )
(2⇡)D/2 |⌃j |1/2 2
Therefore, the likelihood is
N X
Y J
L= Pj N (xi ; µj , ⌃j )
i=1 j=1
!
N X
Y J
1 ⇣ 1 ⌘
= Pj exp (xi µj )> ⌃j 1 (xi µj )
i=1 j=1
(2⇡)D/2 |⌃j |1/2 2

Then the log-likelihood is:

N J
!
X X
LL = log Pj N (xi ; µj , ⌃j )
i=1 j=1
✓ !
N
X J
X 1 ⇣ 1 ⌘◆
= log Pj exp (xi µj )> ⌃j 1 (xi µj )
i=1 j=1
(2⇡)D/2 |⌃j |1/2 2

19
Gaussian Mixture Model
Maximum Likelihood
The objective function:
✓X
J ◆
arg max LL + Pj 1
⇥ j=1
N J
! ✓X
J ◆
X X
arg max log Pj N (xi ; µj , ⌃j ) + Pj 1
⇥ i=1 j=1 j=1

Differentiation is difficult
Why?
• Because of the summation inside the logarithm

What should we do?

Jensen’s inequality
!
J
X Pj N (xi ; µj , ⌃j )U ij J
X ⇣ Pj N (xi ; µ , ⌃j ) ⌘
j
log U ij log
j=1
U ij j=1
U ij
What? Why?

20
Gaussian Mixture Model
Jensen’s Inequality
For example ✓
1 2
◆
1 2
f x1 + x2 f (x1 ) + f (x2 )
3 3 3 3
f (x1 ) ⇥
⇥
⇥ f (x2 )
⇥

x1 1 2
x1 + x2
x2
3 3
✓P ◆ P
ax a f (xi )
For a concave function f : f Pi i Pi
ai ai
P P P
Or: f ai xi ai f (xi ) if ai = 1 and ai 0
2
Logarithmic functions are concave 0

ln p k
Why? 1 -2
0
log (x) =
x -4
1 0 5 10 15 20
log00 (x) = pk
x2

21
Gaussian Mixture Model
Expectation Maximization (EM)
Let’s get back to the ML problem for GMM
!
N
X J
X Pj N (xi ; µj , ⌃j )U ij X
LL = log U ij = 1 and U ij 0
i=1 j=1
U ij j
N XJ
!
X Pj N (xi ; µj , ⌃j )
U ij log Jensen's inequality
i=1 j=1
U ij
N X
J
!
X p(j|xi )p(xi ) p(xi |j)p(j) N (xi ; µj , ⌃j )Pj
= U ij log * p(j|xi ) = P =
i=1 j=1
U ij j p(xi |j)p(j) p(xi )
N X J
! J
X p(j|xi ) X
= U ij log + U ij log p(xi )
i=1 j=1
U ij j=1
N X
J
! N
X p(j|xi ) X
= U ij log + log p(xi ) = DKL U ij p(j|xi ) + log p(xi )
i=1 j=1
U ij
i=1

What if we fix all the other parameters and maximize LL w.r.t. U ij ?

When the KL divergence is minimal: U ij = p(j|xi )

This procedure is the E-step of the EM algorithm for GMM

22
Gaussian Mixture Model ⇣ ⌘
Expectation Maximization (EM) N (xi ; µj , ⌃j ) =
1
exp
1
(xi > 1
µj ) ⌃j (xi µj )
(2⇡)D/2 |⌃j |1/2 2
M-step
PJ
We find ⇥ that maximizes LL+ ( j=1 Pj 1)
<latexit sha1_base64="cJqTnKS5Shxu4mVhENpM7OMbCuI=">AAACQnicZZDLbhMxFIY95dISbgGWbCwipCJKMoMq0Q0oEl0gxCJIpK1Uh5HHcyZx4svIPlMRjeZ1+iRdsi0SrwArxJYFzjQLSo9k6/Pv8x/Zf1Yq6TGOv0cb167fuLm5datz+87de/e7Dx4eeFs5AWNhlXVHGfegpIExSlRwVDrgOlNwmC3eru4PT8B5ac0nXJYw0XxqZCEFxyCl3WHNvHCyRI9LBZQJrj6023OmwpScMwUFbjNf6bSev06az+/pKJ2/SJiT0xk+a9JuL+7HbdGrkKyhR9Y1Srs/WW5FpcGgUNz74yQucVJzh1IoaDqs8lByseBTOA5ouAY/qdufNvRpUHJaWBeWQdqq/zpqrr1f6myHBtAcZzs008G2Qn95NBZ7k1qaskIw4mJyUSmKlq5yorl0IFAtA/AQUHgcFTPuuMCQZoe1xnow9uE00NLMYSH1YN/ZMrNfBjkUfQ/YdEI6yf9ZXIWDl/0k7icfd3vDN+uctshj8oRsk4S8IkPyjozImAhySr6Sc/ItOot+RL+i3xetG9Ha84hcqujPX/PHsEU=</latexit>

N X
J
!
X Pj N (xi ; µj , ⌃j )
LL U ij log
i=1 j=1
U ij
XN X J N X
X J constant
= U ij log Pj N (xi ; µj , ⌃j ) U ij log U ij
i=1 j=1 i=1 j=1

Therefore the final objective function for the M-step is

XN X J ⇣X ⌘
arg max U ij log Pj N (xi ; µj , ⌃j ) + Pj 1
⇥ i=1 j=1 j

More specifically (since we’re going to do the partial differentiation)

1 XN ⇣1 1 ⌘
>
> 11
N (xi ; µj ,arg
⌃jmax
) = Jµ , Jµ = exp U ij (x xii µj ) ⌃⌃j (xii µj ) + const.
µj (2⇡)j D/2 |⌃jj |1/2i=1 2 2
X N X
arg max JPj , JPj = U ij log Pj + ( Pj 1) + const.
Pj i=1 j

1⇣ 1 ⇣ N
X 11 ⌘
⌘
1
N (x
arg max ;µ
Ji⌃ , , ⌃Jj⌃
) j== U ij exp
log |⌃j | (x
(xi µjj))>
µ >
⌃
⌃j (xi µj ) + const.
⌃j
j j (2⇡) D/2 |⌃ j | 1/2
2 22 i
i=1

23
Gaussian Mixture Model
Expectation Maximization (EM)
M-step
Partial differentiation w.r.t. the parameters and find the local maxima
• For the means:
X N PN XN ⇣ ⌘
@Jµj U ij x i
1 1 1 >
> 1
1
= U ij xi µj = 0, µj = Pi=1 N (xi ; µj , ⌃j )arg
= max JD/2
µj , Jµ 1/2
=exp U ij(xx
ii µ
µjj) ⌃ (xii µµj j )+ const.
⌃j (x
@µj N µ (2⇡) |⌃ j | j
2 2
i=1 U ij i=1
j
i=1
• For the priors: PN N
X X
@JPj X @JPj
=( Pj 1) = 0 i=1 U ij arg max JPj , J Pj = U ij log Pj + ( Pj 1) + const.
= + =0
@ j
@Pj Pj Pj i=1 j
J X
X N J
X
, U ij = Pj
j=1 i=1 j=1
J X
X N
, = U ij = N
j=1 i=1
PN
i=1 U ij
, Pj =
N
P
• For the covariance (see matrixcookbook 2.1.2 and 2.2): i U ij (xi µj )(xi µj )>
⌃j = P
i U ij

24
Gaussian Mixture Model
Expectation Maximization (EM)
E-step: calculate posterior probabilities
Pj N (xi ; µj , ⌃j )
U ij = p(j|xi ) = P
j Pj N (xi ; µj , ⌃j )

M-step: update parameters PN

U ij xi
µj = Pi=1
N
i=1 U ij
PN
i=1 U ij
Pj =
N
P
i U ij (xi µj )(xi µj )>
⌃j = P
i U ij

25
Gaussian Mixture Model
Too much math?

26
Mixture of Multinomial Distributions
Clustering Musical Notes
How many clusters? (samples are magnitude spectra)
I’m curious what kind of notes are there in the signal
8000

7000

6000

5000
Freq (Hz)

4000

3000

2000

1000

0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3 3.3 3.6 3.9 4.2 4.5
Time (sec)

27
Mixture of Multinomial Distributions
Clustering Musical Notes
For i-th spectrum xi :
Although the magnitudes are real numbers we can scale them to convert them into integers:
[1.2, 1.35, 0.1, 5.525] = 0.025 ⇥ [48, 54, 4, 221]
Then, we can think of this as an observation from a multinomial dist.
N ! Y xid
M(xi ; ✓) = Q ✓d
d xid ! d
EM for mixture of multinomial distributions
Initialize two mean spectra (random numbers) ✓ 1 , ✓ 2 2 RD
+
Initialize two prior prob (random numbers that sum to one) P1 + P2 = 1
Calculate posterior prob (E-step)
P1 M(xi ; ✓ 1 ) It’s usually fine not to convert the spectra
p(j = 1|xi ) = into integers, although it’s not strictly correct.
P1 M(xi ; ✓ 1 ) + P2 M(xi ; ✓ 2 ) In this example I just normalized each spectrum.
Q xid Q xid
P1 Q Nx!id ! d ✓1d P
d
Q 1 d ✓1dQ
= Q xid Q xid = xid xid
P1 Q Nx!id ! d ✓1d + P2 Q Nx!id ! d ✓2d P1 d ✓1d + P2 d ✓2d
d d
Update means (M-step)

28
Mixture of Multinomial Distributions
Clustering Musical Notes
It’s actually a difficult clustering task with a lot of spurious local minima
nfft=1024, hop=256 nfft=4096, hop=512
8000 8000

7000 7000

6000 6000

5000 5000
Freq (Hz)

Freq (Hz)
4000 4000

3000 3000

2000 2000

1000 1000

0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3 3.3 3.6 3.9 4.2 4.5 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3 3.3 3.6 3.9 4.2 4.5
Time (sec) Time (sec)

p(1|x ) 0.7 p(1|x ) 0.7

i i

0.6 0.6

0.5 0.5

0.4 0.4

1-p(1|xi) 0.3 1-p(1|xi) 0.3

0 50 100 150 200 250 0 50 100 150

29
Locality Sensitive Hashing
From clustering to hashing
Hashing is a popular concept in database
There is a query to the database
Instead of comparing the original representations, a hash function maps the query down to an integer (binary) address
The address is associated with a bucket. It can contain a few different database records
• We say that those records collide
Then, we refine the search inside the bucket
This is cheaper than seeing the entire database

Traditional challenges
Records are better off evenly distributed (for the speed)
Overflow

30
Locality Sensitive Hashing
From clustering to hashing
There’s another hashing concept in machine learning
Locality sensitive hashing or semantic hashing
For the data points xi and xj in a D-dimensional space
If they are close enough D(xi ||xj ) < ⌧
Then, the Hamming distance between them after hashing is zero H (xi )|| (xj ) = 0 with probability p
If they are far enough D(xi ||xj ) c⌧
Then, the Hamming distance between them after hashing is zero H (xi )|| (xj ) = 0 with probability q
p>q !
The hash function (xi ) that meets above conditions are said to be in the locality sensitive hash
function family
In other words…
Originally similar items collide in the same bucket
Share the same address
Quantized using the same binary string
Are in the same cluster

31
Locality Sensitive Hashing
From clustering to hashing
How do we find the hash function?
Well, it’s not easy

One way is to rely on a bunch of random projections

2 3 02 3 2 31
+1 +1 · · ·
1 1 x1
6 1 7 B6 +1 1 ··· +1 7 6 x2 7C
6 7 B6 7 6 7C
K 6 .. 7 = sign B6 .. .. .. .. 7·6 .. 7C
4 . 5 @4 . . . . 5 4 . 5A
1 +1 +1 · · · 1 xD
With K projections, we can represent D dimensional data with K bits
• e.g. 513 magnitude Fourier coefficients (real) with K=128 bits

Why does it work?

Johnson-Lindenstrauss Theorem

It’s also related to sparse coding and compressive sensing

32
8000

7000

Locality Sensitive Hashing

6000

5000

Freq (Hz)
4000

From clustering to hashing

3000

2000

1000

Let’s see how the original spectra are similar to each other 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4
Time (sec)
2.7 3 3.3 3.6 3.9 4.2 4.5

Euclidean Distance Inner Product Cosine Distance

50 50 50

100 100 100

150 150 150

200 200 200

250 250 250

50 100 150 200 250 50 100 150 200 250 50 100 150 200 250

And in terms of Hamming distance after random projection

Hamming, K=32 Hamming, K=128 Hamming, K=512

50 50 50

100 100 100

150 150 150

200 200 200

250 250 250

50 100 150 200 250 50 100 150 200 250 50 100 150 200 250

33
Spectral Hashing
More machine learning involved
Just another hashing technique, but it tries to minimize the difference between

Original pairwise similarity Pairwise Hamming distance

X 8000

min W ij H(y i ||y j )

7000
ij

subject to: y i 2 { 1, +1}K 6000

X
yi = 0 5000

Freq (Hz)
i 4000

y>
i yj = 0 if i 6= j
3000

See the paper for the more 2000

optimization detail 1000

But, the basic idea is to see the problem

as an eigendecomposition problem 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3 3.3 3.6 3.9 4.2 4.5
Time (sec)

That’s why it’s called spectral hashing 1

-1
0 50 100 150 200 250

Weiss, Yair, Antonio Torralba, and Rob Fergus. "Spectral hashing." Advances in neural information processing systems. 2009.

34
Locality Sensitive Hashing
Why is it useful?
For faster detection
Matching hash codes x̃t (Xt )

Query (Q) ! q̃
DB of millions of items

Keyword spotting demo HMM versus Spatial-Temporal WTA Hashing

1
Clean query 1
Noisy query
Hit Rate (Hits / Total Positives)

Hit Rate (Hits / Total Positives)

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2
Hash; AUC=0.91 Hash; AUC=0.69
HMM; AUC=0.71 HMM; AUC=0.21
0 0
0 1 2 3 4 5 0 1 2 3 4 5
False Positive Rate (FP / min) False Positive Rate (FP / min)

35
Reading
Textbook 6.8 – 6.16
Textbook 2.5.5
Bishop, “Pattern Recognition and Machine Learning” Chapter 9

36
Thank You!

Location Capacity Demand Allocation Telecom Optic
No ratings yet
Location Capacity Demand Allocation Telecom Optic
10 pages
Poster - Template - PPTX (1) (2) A Fe
No ratings yet
Poster - Template - PPTX (1) (2) A Fe
1 page
Lec 12
No ratings yet
Lec 12
15 pages
Intro to Machine Learning Concepts
No ratings yet
Intro to Machine Learning Concepts
30 pages
Clustering
No ratings yet
Clustering
82 pages
Unsupervised Learning Clustering Math
No ratings yet
Unsupervised Learning Clustering Math
28 pages
Module 2 Notes Bcs602
No ratings yet
Module 2 Notes Bcs602
19 pages
Wainwrightslides 1
No ratings yet
Wainwrightslides 1
67 pages
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
No ratings yet
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
3 pages
ML RUSA Module 6 Probablistic EM KNN SVM
No ratings yet
ML RUSA Module 6 Probablistic EM KNN SVM
51 pages
Topic 2 Matlab Examples
No ratings yet
Topic 2 Matlab Examples
5 pages
ML - Unit - 4 - Part Ii
No ratings yet
ML - Unit - 4 - Part Ii
79 pages
06a Math Essentials 2
No ratings yet
06a Math Essentials 2
22 pages
AI60201 Module3 4 Problems
No ratings yet
AI60201 Module3 4 Problems
4 pages
10 SVM
No ratings yet
10 SVM
77 pages
Dis10 Sol PDF
No ratings yet
Dis10 Sol PDF
6 pages
شباتر اله مجمعه
No ratings yet
شباتر اله مجمعه
126 pages
CS 7641 Midterm Exam 2 Solutions
No ratings yet
CS 7641 Midterm Exam 2 Solutions
12 pages
Support Vector Machines: Xiaojin Zhu
No ratings yet
Support Vector Machines: Xiaojin Zhu
41 pages
Chap2 Part2 GMM
No ratings yet
Chap2 Part2 GMM
34 pages
Machine Learning Techniques
No ratings yet
Machine Learning Techniques
8 pages
CS 229, Public Course Problem Set #3: Learning Theory and Unsuper-Vised Learning
No ratings yet
CS 229, Public Course Problem Set #3: Learning Theory and Unsuper-Vised Learning
4 pages
Cse291d 7
No ratings yet
Cse291d 7
39 pages
ML Merge
No ratings yet
ML Merge
145 pages
GANs: A Deep Dive for Researchers
No ratings yet
GANs: A Deep Dive for Researchers
62 pages
HW 1
No ratings yet
HW 1
4 pages
Week 7 - Latent Variable Models and Expectation Maximization
No ratings yet
Week 7 - Latent Variable Models and Expectation Maximization
39 pages
ML.5-Clustering Techniques (Week 9)
No ratings yet
ML.5-Clustering Techniques (Week 9)
71 pages
Lec 11
No ratings yet
Lec 11
57 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Generalized Eigenvalue Proximal Support Vector Machines: T, Y), I 1, 2, ..., M)
No ratings yet
Generalized Eigenvalue Proximal Support Vector Machines: T, Y), I 1, 2, ..., M)
19 pages
ECE 449 Notes
No ratings yet
ECE 449 Notes
5 pages
Lecture 16
No ratings yet
Lecture 16
5 pages
01 Clustering
No ratings yet
01 Clustering
43 pages
CH 2
No ratings yet
CH 2
121 pages
CB PDF
No ratings yet
CB PDF
69 pages
Signal Processing, Representation, Modeling, and Analysis: Alfred M. Bruckstein & Ron Kimmel
No ratings yet
Signal Processing, Representation, Modeling, and Analysis: Alfred M. Bruckstein & Ron Kimmel
50 pages
06a Math Essentials 2
No ratings yet
06a Math Essentials 2
22 pages
MM Algorithms Overview & Applications
No ratings yet
MM Algorithms Overview & Applications
50 pages
Expectation-Maximization For The Gaussian Mixture Model
No ratings yet
Expectation-Maximization For The Gaussian Mixture Model
8 pages
Midterm - EE511 - Part B: K K K K
No ratings yet
Midterm - EE511 - Part B: K K K K
8 pages
SMAI-M20-06: Data, Distances and Learning: C. V. Jawahar
No ratings yet
SMAI-M20-06: Data, Distances and Learning: C. V. Jawahar
24 pages
M146 Lec14 Sidenotes S25
No ratings yet
M146 Lec14 Sidenotes S25
33 pages
4 - SVM
No ratings yet
4 - SVM
58 pages
Lec15 16 Handout
No ratings yet
Lec15 16 Handout
33 pages
INAIO Stage 2 Sample Problems MLTheory
No ratings yet
INAIO Stage 2 Sample Problems MLTheory
6 pages
斯坦福大学机器学习数学基础 33-40
No ratings yet
斯坦福大学机器学习数学基础 33-40
8 pages
Machine Learning
No ratings yet
Machine Learning
33 pages
Week 7 GMM
No ratings yet
Week 7 GMM
9 pages
Weekly Homework X
No ratings yet
Weekly Homework X
15 pages
Conference Template A4
No ratings yet
Conference Template A4
2 pages
Image Segmentation Techniques
No ratings yet
Image Segmentation Techniques
47 pages
Chapter 1 - Part1
No ratings yet
Chapter 1 - Part1
56 pages
Data Scaling and Statistical Methods
No ratings yet
Data Scaling and Statistical Methods
4 pages
Linear Algebra for ML Students
No ratings yet
Linear Algebra for ML Students
65 pages
Intro SVM New Example PDF
100% (1)
Intro SVM New Example PDF
56 pages
PRML Exercise Solutions Guide
No ratings yet
PRML Exercise Solutions Guide
87 pages
ECE457 Pattern Recognition Techniques and Algorithms: Answer All Questions
No ratings yet
ECE457 Pattern Recognition Techniques and Algorithms: Answer All Questions
3 pages
Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
CLRM
No ratings yet
CLRM
15 pages
3 - Discrete-Time Systems
No ratings yet
3 - Discrete-Time Systems
61 pages
1908604-Digital-Image-Processing IMP
No ratings yet
1908604-Digital-Image-Processing IMP
8 pages
System of Linear Equation
No ratings yet
System of Linear Equation
3 pages
Simulation of Insurance Data With Actuar
No ratings yet
Simulation of Insurance Data With Actuar
14 pages
Q1 Mathematics10-Polynomials
No ratings yet
Q1 Mathematics10-Polynomials
8 pages
System IDentification Programs
No ratings yet
System IDentification Programs
19 pages
Unit-2 Advance Concept of Model. Notes
No ratings yet
Unit-2 Advance Concept of Model. Notes
15 pages
Marketing Experts: Segmentation Insights
No ratings yet
Marketing Experts: Segmentation Insights
4 pages
JD Campus Quant Researcher
No ratings yet
JD Campus Quant Researcher
2 pages
(1-4) Vector Calculus-Differential Part
No ratings yet
(1-4) Vector Calculus-Differential Part
90 pages
Accurate Medium-Range Global Weather Forecasting With 3D Neural Networks
No ratings yet
Accurate Medium-Range Global Weather Forecasting With 3D Neural Networks
20 pages
Counting Divisors of A Number in (Tutorial) - Codeforces
No ratings yet
Counting Divisors of A Number in (Tutorial) - Codeforces
8 pages
Statistics - Honours: Paper: CC-4 (Probability and Probability Distributions - II) Full Marks: 50
No ratings yet
Statistics - Honours: Paper: CC-4 (Probability and Probability Distributions - II) Full Marks: 50
2 pages
Monte Carlo
No ratings yet
Monte Carlo
4 pages
Drives Training Foils: PID - Closed Loop Control
No ratings yet
Drives Training Foils: PID - Closed Loop Control
18 pages
As Level Paper 52
No ratings yet
As Level Paper 52
14 pages
EE102 Homework Solutions
No ratings yet
EE102 Homework Solutions
15 pages
Classification of Signals Part 1
No ratings yet
Classification of Signals Part 1
15 pages
SCD-HW1-Full Name-Student ID
No ratings yet
SCD-HW1-Full Name-Student ID
4 pages
DTR
No ratings yet
DTR
2 pages
322CST07-C Programming and Data Structures Lab
No ratings yet
322CST07-C Programming and Data Structures Lab
2 pages
Jia Et Al. - 2021 - Scaling Up Visual and Vision-Language Representati
No ratings yet
Jia Et Al. - 2021 - Scaling Up Visual and Vision-Language Representati
11 pages
Decs I Sem
No ratings yet
Decs I Sem
14 pages
Assignment Solved
No ratings yet
Assignment Solved
8 pages
Research Report
No ratings yet
Research Report
47 pages
Constrained Optimization Matop
No ratings yet
Constrained Optimization Matop
6 pages
Unit 1 Introduction To Digital Signal Processing
No ratings yet
Unit 1 Introduction To Digital Signal Processing
15 pages

M03 Clustering

Uploaded by

M03 Clustering

Uploaded by

ENGR-E 511; ENGR-E 399

“Machine Learning for Signal Processing”

We sample from the (continuous) waveform at every 1/44100 second Time

Each sample is represented with one of 216=65536 values

And they sound better!

4bits (Lloyds) 4bits (Uniform) 8bits (Lloyds) 8bits (Uniform) Original

How do we measure the amount of discrepancy?

arg min ||xi ✓j ||2 0.4

For all examples that belong to the j-th range 0

• Cj holds the indices for j-th range P

Let me put it in another way 1 0 0 0

In other words, we don’t know if the uij matrix is correct

The k-means clustering algorithm on scalar samples

In image, we had 3D vectors (RGB)

Input data with GT means K-means clustering results 2 2

Input data with GT means K-means clustering results

Which one is a better fit?

Then the log-likelihood is:

What should we do?

What if we fix all the other parameters and maximize LL w.r.t. U ij ?

This procedure is the E-step of the EM algorithm for GMM

Therefore the final objective function for the M-step is

More specifically (since we’re going to do the partial differentiation)

M-step: update parameters PN

p(1|x ) 0.7 p(1|x ) 0.7

1-p(1|xi) 0.3 1-p(1|xi) 0.3

One way is to rely on a bunch of random projections

Why does it work?

It’s also related to sparse coding and compressive sensing

Locality Sensitive Hashing

From clustering to hashing

Euclidean Distance Inner Product Cosine Distance

100 100 100

150 150 150

200 200 200

250 250 250

And in terms of Hamming distance after random projection

100 100 100

150 150 150

200 200 200

250 250 250

Original pairwise similarity Pairwise Hamming distance

min W ij H(y i ||y j )

subject to: y i 2 { 1, +1}K 6000

See the paper for the more 2000

optimization detail 1000

But, the basic idea is to see the problem

That’s why it’s called spectral hashing 1

Keyword spotting demo HMM versus Spatial-Temporal WTA Hashing

Hit Rate (Hits / Total Positives)

You might also like