UNSUPERVISED LEARNING
AND CLUSTERING
Jeff Robble, Brian Renzenbrink, Doug Roberts
Unsupervised Procedures
A procedure that uses unlabeled data in its classification process.
Why would we use these?
Collecting and labeling large data sets can be costly
Occasionally, users wish to group data first and label the
groupings second
In some applications, the pattern characteristics can change
over time. Unsupervised procedures can handle these
situations.
Unsupervised procedures can be used to find useful features for
classification
In some situations, unsupervised learning can provide insight
into the structure of the data that helps in designing a classifier
Unsupervised vs. Supervised
Unsupervised learning can be thought of as finding patterns in the
data above and beyond what would be considered pure
unstructured noise. How does it compare to supervised
learning?
With unsupervised learning it is possible to learn larger and more
complex models than with supervised learning. This is because
in supervised learning one is trying to find the connection
between two sets of observations, while unsupervised learning
tries to identify certain latent variables that caused a single set
of observations.
The difference between supervised learning and unsupervised
learning can be thought of as the difference between
discriminant analysis from cluster analysis.
Mixture Densities
We assume that p(x|ωj) can be represented in a functional form that is
determined by the value of parameter vector θj.
For example, if we have p(x|ωj) ~ N(µj, Σj), where N is the function for
a normal gaussian distribution and θj consists of the components µj
and Σj that characterize the average and variance of the distribution.
We need to find the probability of x for a given ωj and θ, but we don’t
know the exact values of the θ components that go into making the
decision. We need to solve: p(x | ω )P(ω )
j j
P(ω j | x) =
p(x)
but instead of p(x|ωj) we have p(x|ωj,θj). We can solve for the
mixture density: c
€ | θ) = ∑ j=1 p(x | ω j ,θ j )P(ω j )
p(x (1)
Mixture Densities
c
p(x | θ) = ∑ p(x | ω j ,θ j )P(ω j ) (1)
j=1
component densities mixing parameters
We make the following assumptions:
€ The samples come from a known number of c classes.
The prior probabilities P(ωj) for each class are known, j = 1…c.
The forms for the class-conditional probability densities p(x|ωj,θj)
are known, j = 1…c.
The values for the c parameter vectors θ1... θc are unknown.
The category labels are unknown unsupervised learning.
Consider the following mixture density where x is binary:
1 x 1 x
P(x | θ) = θ1 (1− θ1 )1−x + θ 2 (1− θ 2 )1−x
2 2
Identifiability: Estimate Unknown Parameter Vector
1
1 x 1 x 2 (θ1 + θ 2 ) if x = 1
P(x | θ) = θ1 (1− θ1 )1−x + θ 2 (1− θ 2 )1−x =
2 2 1− 1 (θ1 + θ 2 ) if x = 0
2
Suppose we had an unlimited number of samples and use
nonparametric methods to determine p(x|θ) such that P(x=1|θ)=.6
and P(x=0|θ)=.4:
€
Try to solve for θ1 and θ2:
1
(θ1 + θ 2 ) = .6 We discover that the mixture distribution is
2 completely unidentifiable. We cannot infer the
1 individual parameters of θ.
−1− (θ1 + θ 2 ) = .4
2
A mixture density, p(x|θ) is identifiable if we can
-1 + θ1 + θ 2 = .2 recover a unique θ such that p(x|θ) ≠ p(x|θ’).
θ1 + θ 2 = 1.2
Maximum Likelihood Estimates
The posterior probability becomes: p(x k | ω i ,θ i )P(ω i )
P(ω i | x k ,θ) = (6)
p(x k | θ)
We make the following assumptions:
The elements of θi and θj are functionally independent if i ≠ j.
p(D|θ) is a differentiable function of θ, where D = {x1, … , xn} of n
€
independently drawn unlabeled samples.
The search for a maximum value of p(D|θ) extending over θ and P(ωj)
is constrained so that: c
P(ω i ) ≥ 0 i = 1,...,c and ∑ P(ω i ) = 1
i=1
Let Pˆ (ω i ) be the maximim likelihood estimate for P(ω i ).
Let θˆ i be the maximim € likelihood estimate for θ i .
1 n
If Pˆ (ω i ) ≠ 0 for any i then Pˆ (ω i ) = ∑ Pˆ (ω i | x k , θˆ )
n k=1 (11)
€
Maximum Likelihood Estimates
1 n
Pˆ (ω i ) = ∑ Pˆ (ω i | x k , θˆ )
n k=1 (11)
The MLE of the probability of a category is the average over the entire
data set of the estimate derived from each sample (weighted equally)
€
p(x k | ω i , θˆ i ) Pˆ (ω i )
Pˆ (ω i | x k , θˆ ) = c (13)
∑ p(x k | ω j ,θˆ j )Pˆ (ω j )
j=1
Bayes theorem. When estimating the probability for ω i , the numerator
depends on ˆ and not on the full θˆ .
θ
€ i
€
Maximum Likelihood Estimates
The gradient must vanish at the value of θ i that maximizes the logarithm of the
likelihood, so the MLE θˆ must satsify the following conditions :
i
n
∑ Pˆ (ω i | x k , θˆ )∇ θ i ln p(x k | ω i , θˆ i ) = 0 i = 1,...,c
k=1 (12)
Consider one sample, so n =1. Since we assumed Pˆ ≠ 0, the probability
€is maximized as a function of θ so ∇ ln p(x | ω , θˆ ) = 0. Note that
i θi k i i
ln(1) = 0, so we are trying to find the a value of θˆ i that maximizes p(.).
€
Applying MLE to Normal Mixtures
Case 1: The only unknown quantities are the mean vectors .
consists of components of
The likelihood of this particular sample is
and its derivative is
Thus, according to Equation 8 in the book the MLE estimate
must satisfy:
where
Applying MLE to Normal Mixtures
If we multiply the above equation by the covariance matrix
and rearranging terms, we obtain the equation for the
maximum likelihood estimate of the mean vector
However, we still need to calculate explicitly. If we have a
good initial estimate we can use a hill climbing
procedure to improve our estimates
Applying MLE to Normal Mixtures
Case 2: The mean vector , the covariance matrix ,
and the prior probabilities are all unknown
In this case the maximum likelihood principle only gives singular
solutions. Usually, singular solutions are unusable. However,
if we restrict our attention to the largest of the finite local
maxima of the likelihood function we can still find
meaningful results.
Using , , and derived from Equations 11-13 we
can find the likelihood of using
Applying MLE to Normal Mixtures
The differentiation of the previous equation gives
and
Where is the Kronecker delta, is the pth element of
, is the pth element of , and is the pqth
element of and
Applying MLE to Normal Mixtures
Using the above differentiation along with Equation 12 we can
find the following equations for the MLE of , ,
and
Applying MLE to Normal Mixtures
These equations work where
p(x k | ω i ,θˆi ) Pˆ (ω i )
Pˆ (ω i | x k ,θˆ ) = c
∑ p(x k | ω j ,θˆ)Pˆ (ω j )
j=1
To solve the equation for the MLE, we should again start with
an initial estimate to evaluate Equation 27, and use
Equations 24-26 to update these estimates.
k-Means Clustering
Clusters numerical data in which each cluster has a center
called the mean
The number of clusters c is assumed to be fixed
The goal of the algorithm is to find the c mean vectors µ1,
µ2, …, µc
The number of clusters c
• May be guessed
• Assigned based on the final application
k-Means Clustering
The following pseudo code shows the basic functionality of the k
-Means algorithm
begin initialize n, c, µ1, µ2, …, µc
do classify n samples according to nearest µi
recompute µi
until no change in µi
return µ1, µ2, …, µc
end
k-Means Clustering
Two dimensional example with c = 3
clusters
Shows the initial cluster centers and
their associated Voronoi tesselation
Each of the three Voronoi cells are used
to calculate new cluster centers
Fuzzy k-Means
The algorithm assumes that each sample xj has a fuzzy
membership in a cluster(s)
The algorithm seeks a minimum of a heuristic global cost function
Where:
b is a free parameter chosen to adjust the “blending” of clusters
b > 1 allows each pattern to belong to multiple clusters (fuzziness)
Fuzzy k-Means
Probabilities of cluster membership for each point are normalized
as
Cluster centers are calculated using Eq. 32
Where:
Fuzzy k-Means
The following is the pseudo code for the Fuzzy k-Means algorithm
begin initialize n, c, b, µ1, …, µc , , i = 1,…,c; j = 1,…,n
normalize by Eq. 30
do recompute µi by Eq. 32
recompute by Eq. 33
until small change in µi and
return µ1, µ2, …, µc
end
Fuzzy k-means
Illustrates the progress of the algorithm
Means lie near the center during the first
iteration since each point has
non-negligible “membership”
Points near the cluster boundaries can
Have membership in more that one
cluster
x-Means
In k-Means the number of clusters is chosen before the algorithm
is applied
In x-Means the Bayesian information criterion (BIC) is used
globally and locally to find the best number of clusters k
BIC is used globally to choose the best model it encounters and
locally to guide all centroid splits
x-Means
The algorithm is supplied:
A data set D = {x1, x2, …, xn} containing n objects in d-dimensional
space
A set of alternative models Mj = {C1, C2, …, Ck} which correspond
to solutions with different values of k
Posterior probabilities P(Mj | D) are used to score the models
x-Means
The BIC is defined as
Where
is the loglikelihood of D according to the jth model and taken at
the maximum likelihood point
pj is the number of parameters in Mj
The maximum likelihood estimate is
Where µ(i) is the centroid associated with xi
x-Means
The point probabilities are
Finally the loglikelihood of the data is
x-Means
Basic functionality of the algorithm
Given a range for k, [kmin, kmax]
Start with k = kmin
Continue to add centroids as needed until kmax is reached
Centroids are added by splitting some centroids in two according to
BIC
The centroid set with the best score is used as the final output
References
Duda, R., Hart, P., Stork, D. Pattern Classification, 2nd ed. John Wiley & Sons,
2001.
G. Gan, C. Ma, and J. Wu. Data Clustering Theory, Algorithms, and
Applications. Society for Industrial and Applied Mathematics, Philadelphia, PA.
2007
Samet, H., 2008. K-Nearest Neighbor Finding Using MaxNearestDist. IEEE
Trans. Pattern Anal. Mach. Intell. 30, 2 (Feb. 2008), 243-252.
Yu-Long Qiao, Jeng-Shyang Pan, Sheng-He Sun .Improved partial distance
search for k nearest-neighbor classification. IEEE International Conference on
Multimedia and Expo, 2004. June 2004, 1275 – 1278.
Ghahramani, Z. Unsupervised Learning. Advanced Lectures on Machine
Learning, 2003, 72-112.