0% found this document useful (0 votes)

21 views28 pages

Unsupervised Learning Clustering Math

The document discusses unsupervised learning and clustering techniques, highlighting their advantages such as cost-effectiveness in data labeling and adaptability to changing data patterns. It contrasts unsupervised learning with supervised learning, emphasizing its ability to uncover complex models and latent variables. Key methods like k-Means, Fuzzy k-Means, and x-Means clustering are introduced, along with their algorithms and applications in data analysis.

Uploaded by

Noah Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views28 pages

Unsupervised Learning Clustering Math

Uploaded by

Noah Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

UNSUPERVISED LEARNING

AND CLUSTERING
Jeff Robble, Brian Renzenbrink, Doug Roberts
Unsupervised Procedures
A procedure that uses unlabeled data in its classification process.
Why would we use these?
 Collecting and labeling large data sets can be costly

 Occasionally, users wish to group data first and label the

groupings second
 In some applications, the pattern characteristics can change
over time. Unsupervised procedures can handle these
situations.
 Unsupervised procedures can be used to find useful features for
classification
 In some situations, unsupervised learning can provide insight
into the structure of the data that helps in designing a classifier
Unsupervised vs. Supervised
Unsupervised learning can be thought of as finding patterns in the
data above and beyond what would be considered pure
unstructured noise. How does it compare to supervised
learning?

With unsupervised learning it is possible to learn larger and more

complex models than with supervised learning. This is because
in supervised learning one is trying to find the connection
between two sets of observations, while unsupervised learning
tries to identify certain latent variables that caused a single set
of observations.

The difference between supervised learning and unsupervised

learning can be thought of as the difference between
discriminant analysis from cluster analysis.
Mixture Densities
We assume that p(x|ωj) can be represented in a functional form that is
determined by the value of parameter vector θj.

For example, if we have p(x|ωj) ~ N(µj, Σj), where N is the function for
a normal gaussian distribution and θj consists of the components µj
and Σj that characterize the average and variance of the distribution.

We need to find the probability of x for a given ωj and θ, but we don’t

component densities mixing parameters

We make the following assumptions:
€ The samples come from a known number of c classes.
 The prior probabilities P(ωj) for each class are known, j = 1…c.
 The forms for the class-conditional probability densities p(x|ωj,θj)
are known, j = 1…c.
 The values for the c parameter vectors θ1... θc are unknown.
 The category labels are unknown  unsupervised learning.

Consider the following mixture density where x is binary:

1 x 1 x
P(x | θ) = θ1 (1− θ1 )1−x + θ 2 (1− θ 2 )1−x
2 2
Identifiability: Estimate Unknown Parameter Vector

1
1 x 1 x  2 (θ1 + θ 2 ) if x = 1
P(x | θ) = θ1 (1− θ1 )1−x + θ 2 (1− θ 2 )1−x =
2 2 1− 1 (θ1 + θ 2 ) if x = 0
 2
Suppose we had an unlimited number of samples and use
nonparametric methods to determine p(x|θ) such that P(x=1|θ)=.6
and P(x=0|θ)=.4:
€
Try to solve for θ1 and θ2:
1
(θ1 + θ 2 ) = .6 We discover that the mixture distribution is
2 completely unidentifiable. We cannot infer the
 1  individual parameters of θ.
−1− (θ1 + θ 2 ) = .4
 2 
A mixture density, p(x|θ) is identifiable if we can
-1 + θ1 + θ 2 = .2 recover a unique θ such that p(x|θ) ≠ p(x|θ’).
θ1 + θ 2 = 1.2
Maximum Likelihood Estimates
The posterior probability becomes: p(x k | ω i ,θ i )P(ω i )
P(ω i | x k ,θ) = (6)
p(x k | θ)
We make the following assumptions:
 The elements of θi and θj are functionally independent if i ≠ j.
 p(D|θ) is a differentiable function of θ, where D = {x1, … , xn} of n
€
independently drawn unlabeled samples.

The search for a maximum value of p(D|θ) extending over θ and P(ωj)
is constrained so that: c
P(ω i ) ≥ 0 i = 1,...,c and ∑ P(ω i ) = 1
i=1

Let Pˆ (ω i ) be the maximim likelihood estimate for P(ω i ).

Let θˆ i be the maximim € likelihood estimate for θ i .
1 n
If Pˆ (ω i ) ≠ 0 for any i then Pˆ (ω i ) = ∑ Pˆ (ω i | x k , θˆ )
n k=1 (11)
€
Maximum Likelihood Estimates
1 n
Pˆ (ω i ) = ∑ Pˆ (ω i | x k , θˆ )
n k=1 (11)

The MLE of the probability of a category is the average over the entire
data set of the estimate derived from each sample (weighted equally)
€

p(x k | ω i , θˆ i ) Pˆ (ω i )
Pˆ (ω i | x k , θˆ ) = c (13)
∑ p(x k | ω j ,θˆ j )Pˆ (ω j )
j=1

Bayes theorem. When estimating the probability for ω i , the numerator

depends on ˆ and not on the full θˆ .
θ
€ i

€
Maximum Likelihood Estimates
The gradient must vanish at the value of θ i that maximizes the logarithm of the
likelihood, so the MLE θˆ must satsify the following conditions :
i

n
∑ Pˆ (ω i | x k , θˆ )∇ θ i ln p(x k | ω i , θˆ i ) = 0 i = 1,...,c
k=1 (12)

Consider one sample, so n =1. Since we assumed Pˆ ≠ 0, the probability

€is maximized as a function of θ so ∇ ln p(x | ω , θˆ ) = 0. Note that
i θi k i i

ln(1) = 0, so we are trying to find the a value of θˆ i that maximizes p(.).

€
Applying MLE to Normal Mixtures
Case 1: The only unknown quantities are the mean vectors .
consists of components of
The likelihood of this particular sample is

and its derivative is

Thus, according to Equation 8 in the book the MLE estimate

must satisfy:

where
Applying MLE to Normal Mixtures
If we multiply the above equation by the covariance matrix
and rearranging terms, we obtain the equation for the
maximum likelihood estimate of the mean vector

However, we still need to calculate explicitly. If we have a

good initial estimate we can use a hill climbing
procedure to improve our estimates
Applying MLE to Normal Mixtures
Case 2: The mean vector , the covariance matrix ,
and the prior probabilities are all unknown
In this case the maximum likelihood principle only gives singular
solutions. Usually, singular solutions are unusable. However,
if we restrict our attention to the largest of the finite local
maxima of the likelihood function we can still find
meaningful results.
Using , , and derived from Equations 11-13 we
can find the likelihood of using
Applying MLE to Normal Mixtures
The differentiation of the previous equation gives

and

Where is the Kronecker delta, is the pth element of

, is the pth element of , and is the pqth
element of and
Applying MLE to Normal Mixtures
Using the above differentiation along with Equation 12 we can
find the following equations for the MLE of , ,
and
Applying MLE to Normal Mixtures
These equations work where
p(x k | ω i ,θˆi ) Pˆ (ω i )
Pˆ (ω i | x k ,θˆ ) = c
∑ p(x k | ω j ,θˆ)Pˆ (ω j )
j=1

To solve the equation for the MLE, we should again start with
an initial estimate to evaluate Equation 27, and use
Equations 24-26 to update these estimates.
k-Means Clustering
Clusters numerical data in which each cluster has a center
called the mean
The number of clusters c is assumed to be fixed
The goal of the algorithm is to find the c mean vectors µ1,
µ2, …, µc
The number of clusters c
• May be guessed

• Assigned based on the final application

k-Means Clustering
The following pseudo code shows the basic functionality of the k
-Means algorithm

begin initialize n, c, µ1, µ2, …, µc

do classify n samples according to nearest µi
recompute µi
until no change in µi
return µ1, µ2, …, µc
end
k-Means Clustering
Two dimensional example with c = 3
clusters

Shows the initial cluster centers and

their associated Voronoi tesselation

Each of the three Voronoi cells are used

to calculate new cluster centers
Fuzzy k-Means
The algorithm assumes that each sample xj has a fuzzy
membership in a cluster(s)
The algorithm seeks a minimum of a heuristic global cost function

Where:
 b is a free parameter chosen to adjust the “blending” of clusters
 b > 1 allows each pattern to belong to multiple clusters (fuzziness)
Fuzzy k-Means
Probabilities of cluster membership for each point are normalized
as

Cluster centers are calculated using Eq. 32

Where:
Fuzzy k-Means
The following is the pseudo code for the Fuzzy k-Means algorithm

begin initialize n, c, b, µ1, …, µc , , i = 1,…,c; j = 1,…,n

normalize by Eq. 30
do recompute µi by Eq. 32
recompute by Eq. 33
until small change in µi and
return µ1, µ2, …, µc
end
Fuzzy k-means
Illustrates the progress of the algorithm

Means lie near the center during the first

iteration since each point has
non-negligible “membership”

Points near the cluster boundaries can

Have membership in more that one
cluster
x-Means
In k-Means the number of clusters is chosen before the algorithm
is applied

In x-Means the Bayesian information criterion (BIC) is used

globally and locally to find the best number of clusters k

BIC is used globally to choose the best model it encounters and

locally to guide all centroid splits
x-Means
The algorithm is supplied:
 A data set D = {x1, x2, …, xn} containing n objects in d-dimensional
space
 A set of alternative models Mj = {C1, C2, …, Ck} which correspond
to solutions with different values of k
 Posterior probabilities P(Mj | D) are used to score the models
x-Means
The BIC is defined as

Where
 is the loglikelihood of D according to the jth model and taken at
the maximum likelihood point
 pj is the number of parameters in Mj
The maximum likelihood estimate is

Where µ(i) is the centroid associated with xi

x-Means
The point probabilities are

Finally the loglikelihood of the data is

x-Means
Basic functionality of the algorithm
 Given a range for k, [kmin, kmax]
 Start with k = kmin
 Continue to add centroids as needed until kmax is reached
 Centroids are added by splitting some centroids in two according to
BIC
 The centroid set with the best score is used as the final output
References
Duda, R., Hart, P., Stork, D. Pattern Classification, 2nd ed. John Wiley & Sons,
2001.

G. Gan, C. Ma, and J. Wu. Data Clustering Theory, Algorithms, and

Applications. Society for Industrial and Applied Mathematics, Philadelphia, PA.
2007

Samet, H., 2008. K-Nearest Neighbor Finding Using MaxNearestDist. IEEE

Trans. Pattern Anal. Mach. Intell. 30, 2 (Feb. 2008), 243-252.

Yu-Long Qiao, Jeng-Shyang Pan, Sheng-He Sun .Improved partial distance

search for k nearest-neighbor classification. IEEE International Conference on
Multimedia and Expo, 2004. June 2004, 1275 – 1278.

Ghahramani, Z. Unsupervised Learning. Advanced Lectures on Machine

Learning, 2003, 72-112.

Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
100% (1)
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
209 pages
CS 229, Summer 2019 Problem Set #1 Solutions
No ratings yet
CS 229, Summer 2019 Problem Set #1 Solutions
22 pages
Mixture Models and Expectation-Maximization: Justus H. Piater
No ratings yet
Mixture Models and Expectation-Maximization: Justus H. Piater
11 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
No ratings yet
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
12 pages
Dsci303-19 GM - em
No ratings yet
Dsci303-19 GM - em
81 pages
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
No ratings yet
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
3 pages
ML Columbia PDF
No ratings yet
ML Columbia PDF
615 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
Statistical Machine Learning W4400 Lecture Slides PDF
No ratings yet
Statistical Machine Learning W4400 Lecture Slides PDF
520 pages
Estimating Animal Abundance
No ratings yet
Estimating Animal Abundance
134 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
67 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Likelihood, Bayesian, and Decision Theory
No ratings yet
Likelihood, Bayesian, and Decision Theory
50 pages
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
No ratings yet
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
35 pages
Bayesian and MLE
No ratings yet
Bayesian and MLE
30 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
L6: Parameter Estimation: Parameter Estimation Maximum Likelihood Bayesian Estimation Numerical Examples
No ratings yet
L6: Parameter Estimation: Parameter Estimation Maximum Likelihood Bayesian Estimation Numerical Examples
15 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Intro to Machine Learning Concepts
No ratings yet
Intro to Machine Learning Concepts
30 pages
Unsupervised Learning and Other Neural Networks: CSE 1513 Soft Computing Dr. Djamel Bouchaffra
No ratings yet
Unsupervised Learning and Other Neural Networks: CSE 1513 Soft Computing Dr. Djamel Bouchaffra
12 pages
Understanding SE, SD, and MLE
No ratings yet
Understanding SE, SD, and MLE
10 pages
From Physics To Economics
No ratings yet
From Physics To Economics
19 pages
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
No ratings yet
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
6 pages
Probability Distributions Guide
No ratings yet
Probability Distributions Guide
86 pages
Fitting A Mixture Distribution To Data
No ratings yet
Fitting A Mixture Distribution To Data
12 pages
Reservoir Seismicity Study Report
No ratings yet
Reservoir Seismicity Study Report
97 pages
ML Notes
No ratings yet
ML Notes
4 pages
Effects of Urbanization in Kumbakonam
No ratings yet
Effects of Urbanization in Kumbakonam
12 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
Robotic AST Cleaning Insights
No ratings yet
Robotic AST Cleaning Insights
19 pages
Lecture 10
No ratings yet
Lecture 10
59 pages
Model Inference and Averaging: Dept. Computer Science & Engineering, Shanghai Jiao Tong University
No ratings yet
Model Inference and Averaging: Dept. Computer Science & Engineering, Shanghai Jiao Tong University
51 pages
Compendium Iim Shillong Analytics and Prod Man
No ratings yet
Compendium Iim Shillong Analytics and Prod Man
68 pages
Machine Learning Homework Guide
No ratings yet
Machine Learning Homework Guide
6 pages
Solutions To Problems: X 0 ZX X y
No ratings yet
Solutions To Problems: X 0 ZX X y
19 pages
Analysis of Binary Panel Data by Static and Dynamic Logit Models
No ratings yet
Analysis of Binary Panel Data by Static and Dynamic Logit Models
45 pages
VWangenheim-Bayón2007 Article TheChainFromCustomerSatisfacti
No ratings yet
VWangenheim-Bayón2007 Article TheChainFromCustomerSatisfacti
17 pages
Maximum Likelihood Estimation Lecture
No ratings yet
Maximum Likelihood Estimation Lecture
22 pages
How Affordable Is HUD Affordable Housing - Reportv4
No ratings yet
How Affordable Is HUD Affordable Housing - Reportv4
52 pages
Inf 2
No ratings yet
Inf 2
37 pages
Chap2 Part2 GMM
No ratings yet
Chap2 Part2 GMM
34 pages
Midterm - EE511 - Part B: K K K K
No ratings yet
Midterm - EE511 - Part B: K K K K
8 pages
Lecture 4
No ratings yet
Lecture 4
51 pages
Pawitan and Lee - Draft-1-80
No ratings yet
Pawitan and Lee - Draft-1-80
80 pages
Dis10 Sol PDF
No ratings yet
Dis10 Sol PDF
6 pages
Week 7 GMM
No ratings yet
Week 7 GMM
9 pages
Box & Cox 1964
No ratings yet
Box & Cox 1964
33 pages
Estimation of Parameters For The Exponentiated Pareto Distribution Based On Progressively type-II Right Censored Data
No ratings yet
Estimation of Parameters For The Exponentiated Pareto Distribution Based On Progressively type-II Right Censored Data
6 pages
Maximum Likelihood Estimation Guide
No ratings yet
Maximum Likelihood Estimation Guide
34 pages
4.2 Bayes Decision Theory
No ratings yet
4.2 Bayes Decision Theory
49 pages
Estimating The Fractionally Integrated GARCH Model
No ratings yet
Estimating The Fractionally Integrated GARCH Model
20 pages
07 Chapter 4
No ratings yet
07 Chapter 4
47 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Machine Learning Estimation Guide
No ratings yet
Machine Learning Estimation Guide
6 pages
Section and Solution
No ratings yet
Section and Solution
4 pages
1 Lecture 5b: Probabilistic Perspectives On ML Algorithms
No ratings yet
1 Lecture 5b: Probabilistic Perspectives On ML Algorithms
6 pages
Machine Learning
No ratings yet
Machine Learning
18 pages
11 Mle
No ratings yet
11 Mle
26 pages
Analisis Gases Disueltos - Ingles
No ratings yet
Analisis Gases Disueltos - Ingles
5 pages
Bivariate Probit Regression Guide
No ratings yet
Bivariate Probit Regression Guide
8 pages
Vikram Mullachery Aniruddh Khera Amir Husain: Bayesian Neural Networks
No ratings yet
Vikram Mullachery Aniruddh Khera Amir Husain: Bayesian Neural Networks
16 pages
Data Discretization Unification
No ratings yet
Data Discretization Unification
14 pages
Expectation-Maximization For The Gaussian Mixture Model
No ratings yet
Expectation-Maximization For The Gaussian Mixture Model
8 pages
ML Map and Bayseian
No ratings yet
ML Map and Bayseian
35 pages
NLP Cat 2
No ratings yet
NLP Cat 2
78 pages
Week 7 - Latent Variable Models and Expectation Maximization
No ratings yet
Week 7 - Latent Variable Models and Expectation Maximization
39 pages
GMMEMNotes
No ratings yet
GMMEMNotes
10 pages
Statistics Chapter 4
No ratings yet
Statistics Chapter 4
28 pages
4.ML Estimation
No ratings yet
4.ML Estimation
19 pages
جلسه پنجم-1
No ratings yet
جلسه پنجم-1
15 pages
MLSlides5 - Selected - Shared
No ratings yet
MLSlides5 - Selected - Shared
30 pages
Config Doc EHS EHP5 - Risk Assessment
No ratings yet
Config Doc EHS EHP5 - Risk Assessment
50 pages
Module 4 - Chapter 2
No ratings yet
Module 4 - Chapter 2
14 pages
Lecture 16
No ratings yet
Lecture 16
5 pages
(Ebook) Statistical Methods in Experimental Physics by James, Frederick ISBN 9789812567956, 9789812705273, 981256795X, 9812705279 Download
100% (1)
(Ebook) Statistical Methods in Experimental Physics by James, Frederick ISBN 9789812567956, 9789812705273, 981256795X, 9812705279 Download
56 pages
Stat-Review Xid-8243919 1
No ratings yet
Stat-Review Xid-8243919 1
24 pages
Lecture 13
No ratings yet
Lecture 13
12 pages
Advanced Wind Turbine Technology 1st Ed Weifei Hu Download
No ratings yet
Advanced Wind Turbine Technology 1st Ed Weifei Hu Download
83 pages
Unit - II
No ratings yet
Unit - II
171 pages
Pattern Recognition Notes Unit-2
No ratings yet
Pattern Recognition Notes Unit-2
15 pages
Detailed Math
No ratings yet
Detailed Math
4 pages
CVPR Unit 5,6
No ratings yet
CVPR Unit 5,6
25 pages
Unit 2
No ratings yet
Unit 2
88 pages
Solutions For Exercises - Pattern Recognition, 4th Edition by Theodoridis & Koutroumbas
No ratings yet
Solutions For Exercises - Pattern Recognition, 4th Edition by Theodoridis & Koutroumbas
13 pages

Unsupervised Learning Clustering Math

Uploaded by

Unsupervised Learning Clustering Math

Uploaded by

UNSUPERVISED LEARNING

 Occasionally, users wish to group data first and label the

With unsupervised learning it is possible to learn larger and more

The difference between supervised learning and unsupervised

We need to find the probability of x for a given ωj and θ, but we don’t

component densities mixing parameters

Consider the following mixture density where x is binary:

Let Pˆ (ω i ) be the maximim likelihood estimate for P(ω i ).

Bayes theorem. When estimating the probability for ω i , the numerator

Consider one sample, so n =1. Since we assumed Pˆ ≠ 0, the probability

ln(1) = 0, so we are trying to find the a value of θˆ i that maximizes p(.).

and its derivative is

Thus, according to Equation 8 in the book the MLE estimate

However, we still need to calculate explicitly. If we have a

Where is the Kronecker delta, is the pth element of

• Assigned based on the final application

begin initialize n, c, µ1, µ2, …, µc

Shows the initial cluster centers and

Each of the three Voronoi cells are used

Cluster centers are calculated using Eq. 32

begin initialize n, c, b, µ1, …, µc , , i = 1,…,c; j = 1,…,n

Means lie near the center during the first

Points near the cluster boundaries can

In x-Means the Bayesian information criterion (BIC) is used

BIC is used globally to choose the best model it encounters and

Where µ(i) is the centroid associated with xi

Finally the loglikelihood of the data is

G. Gan, C. Ma, and J. Wu. Data Clustering Theory, Algorithms, and

Samet, H., 2008. K-Nearest Neighbor Finding Using MaxNearestDist. IEEE

Yu-Long Qiao, Jeng-Shyang Pan, Sheng-He Sun .Improved partial distance

Ghahramani, Z. Unsupervised Learning. Advanced Lectures on Machine

You might also like