Weighted Clustering: Margareta Ackerman Shai Ben-David Simina BR Anzei David Loker
Weighted Clustering: Margareta Ackerman Shai Ben-David Simina BR Anzei David Loker
Weighted Clustering
858
Our work falls within a recent framework for clustering algorithms based on their response to weights. First, we de-
algorithm selection. The framework is based on identifying fine what it means for a partitional algorithm to be weight
properties that address the input/output behaviour of algo- responsive on a clustering. We present an analogous defini-
rithms. Algorithms are classified based on intuitive, user- tion for hierarchical algorithms when we study hierarchical
friendly properties, and the classification can then be used to algorithms below.
assist users in selecting a clustering algorithm for their spe- Definition 1 (Weight responsive). A partitional cluster-
cific application. So far, research in this framework has fo- ing algorithm A is weight-responsive on a clustering C of
cused on the unweighed partitional (Ackerman, Ben-David, (X, d) if
and Loker 2010a), (Bosagh-Zadeh and Ben-David 2009),
1. there exists a weight function w so that A(w[X], d) = C,
(Ackerman, Ben-David, and Loker 2010b) and hierarchical
and
settings (Ackerman and Ben-David 2011). This is the first
application of the framework to weighted clustering. 2. there exists a weight function w0 so that A(w0 [X], d) 6=
C.
Preliminaries Weight-sensitive algorithms are weight-responsive on all
clusterings in their range.
A weight function w over X is a function w : X → R+ .
Given a domain set X, denote the corresponding weighted Definition 2 (Weight Sensitive). An algorithm A is weight-
domain by w[X], thereby associating each element x ∈ X sensitive if for all (X, d) and all C ∈ range(A(X, d)), A is
with weight w(x). A distance function is a symmetric func- weight-responsive on C.
tion d : X × X → R+ ∪ {0}, such that d(x, y) = 0 if and At the other extreme are clustering algorithms that do not
only if x = y. We consider weighted data sets of the form respond to weights on any data set. This is the only category
(w[X], d), where X is some finite domain set, d is a distance that has been considered in previous work, corresponding to
function over X, and w is a weight function over X. “point proportion admissibility”(Fisher and Ness 1971).
A k-clustering C = {C1 , C2 , . . . , Ck } of a domain set X Definition 3 (Weight Robust). An algorithm A is weight-
is a partition of X into 1 < k < |X| disjoint, non-empty robust if for all (X, d) and all clusterings C of (X, d), A is
subsets of X where ∪i Ci = X. A clustering of X is a k- not weight-responsive on C.
clustering for some 1 < k < |X|. To avoid trivial partitions, Finally, there are algorithms that respond to weights on
clusterings that consist of a single cluster, or where every some clusterings, but not on others.
cluster has a unique element, are not permitted.
Definition 4 (Weight Considering). An algorithm A is
PDenote the weight of a cluster Ci ∈ C by w(Ci ) = weight-considering if
x∈Ci w(x). For a clustering C, let |C| denote the num-
ber of clusters in C. For x, y ∈ X and clustering C of X, • There exists an (X, d) and a clustering C of (X, d) so that
write x ∼C y if x and y belong to the same cluster in C and A is weight-responsive on C.
x 6∼C y, otherwise. • There exists an (X, d) and C ∈ range(A(X, d)) so that
A partitional weighted clustering algorithm is a function A is not weight-responsive on C.
that maps a data set (w[X], d) and an integer 1 < k < |X| To formulate clustering algorithms in the weighted set-
to a k-clustering of X. ting, we consider their behaviour on data that allows dupli-
A dendrogram D of X is a pair (T, M ) where T is a cates. Given a data set (X, d), elements x, y ∈ X are dupli-
binary rooted tree and M : leaves(T ) → X is a bi- cates if d(x, y) = 0 and d(x, z) = d(y, z) for all z ∈ X. In a
jection. A hierarchical weighted clustering algorithm is a Euclidean space, duplicates correspond to elements that oc-
function that maps a data set (w[X], d) to a dendrogram cur at the same location. We obtain the weighted version of
of X. A set C0 ⊆ X is a cluster in a dendrogram D = a data set by de-duplicating the data, and associating every
(T, M ) of X if there exists a node x in T so that C0 = element with a weight equaling the number of duplicates of
{M (y) | y is a leaf and a descendent of x}. For a hierar- that element in the original data. The weighted version of an
chical weighted clustering algorithm A, A(w[X], d) out- algorithm partitions the resulting weighted data in the same
puts a clustering C = {C1 , . . . , Ck } if Ci is a cluster in manner that the unweighted version partitions the original
A(w[X], d) for all 1 ≤ i ≤ k. A partitional algorithm A data. As shown throughout the paper, this translation leads
outputs clustering C on (w[X], d) if A(w[X], d, |C|) = C. to natural formulations of weighted algorithms.
For the remainder of this paper, unless otherwise stated,
we will use the term “clustering algorithm” for “weighted Partitional Methods
clustering algorithm”.
In this section, we show that partitional clustering algo-
Finally, given clustering algorithm A and data set (X, d),
rithms respond to weights in a variety of ways. Many pop-
let range(A(X, d)) = {C | ∃w such that A outputs C
ular partitional clustering paradigms, including k-means, k-
on (w[X], d)}, i.e. the set of clusterings that A outputs on
median, and min-sum, are weight sensitive. It is easy to see
(X, d) over all possible weight functions.
that methods such as min-diameter and k-center are weight-
robust. We begin by analysing the behaviour of a spectral
Basic Categories objective function ratio cut, which exhibits interesting be-
Different clustering algorithms respond differently to haviour on weighted data by responding to weight unless
weights. We introduce a formal categorisation of clustering data is highly structured.
859
A−s(x,y) A A
Ratio-Cut Spectral Clustering m−1 <m holds when m < s(x, y), and the latter holds
We investigate the behaviour of a spectral objective function, by choice of x and y.
ratio-cut (Von Luxburg 2007), on weighted data. Instead of Case 2: The similarities between every pair of clusters are
a distance function, spectral clustering relies on a similar- the same. However, there are clusters C1 , C2 , C3 ∈ C, so
ity function, which maps pairs of domain elements to non- that the similarities between C1 and C2 are greater than the
negative real numbers that represent how alike the elements ones between C1 and C3 . Let a and b denote the similarities
are. between C1 , C2 and C1 , C3 , respectively.
The ratio-cut of a clustering C is rcut(C, w[X], s) = Let x ∈ C1 and w a weight function, such that w(x) = W
P for large W , and weight 1 is assigned to all other points in
1 X x∈Ci ,y∈X\Ci s(x, y) · w(x) · w(y) X. The dominant term comes from clusters going into C1 ,
P . specifically edges that include point x. The dominant term
2 C ∈C x∈Ci w(x)
i
of the contribution of cluster C3 is W b and the dominant
The ratio-cut clustering function is rcut(w[X], s, k) = term of the contribution of C2 is W a, totalling W a + W b.
arg minC;|C|=k rcut(C, w[X], s). We prove that this func- Now consider clustering C 0 obtained from clustering C
tion ignores data weights only when the data satisfies a very by merging C1 with C2 , and splitting C3 into two clus-
strict notion of clusterability. To characterise precisely when ters (arbitrarily). The dominant term of the clustering comes
ratio-cut responds to weights, we first present a few defini- from clusters other than C1 ∪ C2 , and the cost of clusters
tions. outside C1 ∪ C2 ∪ C3 is unaffected. The dominant term of
A clustering C of (w[X], s) is perfect if for all the cost of the two clusters obtained by splitting C3 is W b
x1 , x2 , x3 , x4 ∈ X where x1 ∼C x2 and x3 6∼C x4 , for each, for a total of 2W b. However, the factor of W a that
s(x1 , s2 ) > s(x3 , x4 ). C is separation-uniform if there ex- C2 previously contributed is no longer present. This replaces
ists λ so that for all x, y ∈ X where x 6∼C y, s(x, y) = λ. the coefficient of the dominant term from a + b to 2b, which
Note that neither condition depends on the weight function. improved the cost of the clustering because b < a.
We show that whenever a data set has a clustering that is Lemma 2. Given a clustering C of (X, s) where every clus-
both perfect and separation-uniform, then ratio-cut uncovers ter has more than one point, if C is not perfect than ratio-cut
that clustering, which implies that ratio-cut is not weight- is weight-responsive on C.
sensitive. Note that these conditions are satisfied when all
between-cluster similarities are set to zero. On the other The proof of the lemma is included in the ap-
hand, we show that ratio-cut does respond to weights when pendix (Anonymous 2012).
either condition fails. Lemma 3. Given any data set (w[X], s) that has a perfect,
Lemma 1. Given a clustering C of (X, s) where every clus- separation-uniform k-clustering C, ratio-cut(w[X], s, k) =
ter has more than one point, if C is not separation-uniform C.
then ratio-cut is weight-responsive on C.
Proof. Let (w[X], s) be a weighted data set, with a perfect,
separation-uniform clusteringPC = {C1 , . . . , Ck }. Recall
Proof. We consider two cases. that for any Y ⊆ X, w(Y ) = y∈Y w(y). Then,
Case 1: There is a pair of clusters with different similar-
ities between them. Then there exist C1 , C2 ∈ C, x ∈ C1 ,
k
P P
and y ∈ C2 so that s(x, y) ≥ s(x, z) for all z ∈ C2 , and 1X x∈Ci y∈Ci s(x, y)w(x)w(y)
rcut(C, w[X], s) =
there exists a ∈ C2 so that s(x, y) > s(x, a).
P
2 i=1 x∈Ci w(x)
Let w be a weight function such that w(x) = W for some k
P P
sufficiently large W and weight 1 is assigned to all other 1 x∈Ci λw(x)w(y)
P y∈Ci
X
=
points in X. Since we can set W to be arbitrarily large, when 2 i=1 x∈Ci w(x)
looking at the cost of a cluster, it suffices to consider the k
P P k
dominant term in terms of W . We will show that we can λ X w(y)
y∈Ci x∈Ci w(x) λX X
= P = w(y)
improve the cost of C by moving a point from C2 to C1 . 2 i=1 x∈Ci w(x) 2 i=1
y∈Ci
Note that moving a point from C2 to C1 does not affect the k k
dominant term of clusters other than C1 and C2 . Therefore, λ X δ X
= w(Ci ) = [w(X) − w(Ci )]
we consider the cost of these two clusters before and after 2 i=1
2 i=1
rearrangingP points between these clusters. k
!
Let A = a∈C2 s(x, a) and let m = |C2 |. Then the dom- λ X λ
= kw(X) − w(Ci ) = (k − 1)w(X).
A
inant term, in terms of W , of the cost of C2 is W m . The cost 2 i=1
2
of C1 approaches a constant as W → ∞. 0 0 0
Now consider clustering C 0 obtained from C by moving y Consider any other clustering, C = {C1 , . . . , Ck } 6= C.
from cluster C2 to cluster C1 . The dominant term in the cost Since C is both perfect and separation-uniform, all between-
of C2 becomes W A−s(x,y) cluster similarities in C equal λ, and all within-cluster simi-
m−1 , and the cost of C1 approaches
larities are greater than λ. From here it follows that all pair-
a constant as W → ∞. By choice of x and y, if A−s(x,y)
m−1 < wise similarities in the data are at least λ. Since C is a k-
A 0
m then C has lower loss than C when W is large enough. clustering different from C, it must differ from C on at least
860
one between-cluster edge, so that edge must be greater than Proof. Consider any S ⊆ X. Let w be a weight func-
λ. 0
tion over X where w(x) = W if x ∈ S, for large W ,
So the cost of C is, and w(x) = 1 otherwise. As shown by (Ostrovsky et
P P al. 2006), the k-means objective function is equivalent to
k 0 s(x, y)w(x)w(y) d(x,y)2 ·w(x)·w(y)
1 X x∈Ci y∈Ci0
P
0
rcut(C , w[X], s) = P
x,y∈Ci
w(Ci ) Let m1 = minx,y∈X d(x, y)2 >
.
2 i=1 x∈Ci
0 w(x)
P P 0, m2 = maxx,y∈X d(x, y)2 , and n = |X|. Con-
k 0 λw(x)w(y) sider any k-clustering C where all the elements in S be-
1 X x∈Ci y∈Ci0
> P long to distinct clusters. Then k-means(C, w[X], d) <
2 i=1 x∈C
0 w(x)
2
i
km2 (n + nW ). On the other hand, given any k-clustering
λ 0
= (k − 1)w(X) = rcut(C). C where at least two elements of S appear in the
2 2
same cluster, k-means(C 0 , w[X], d) ≥ W m1
W +n . Since
0
0
So clustering C has a higher cost than C. limW →∞ k-means(C ,w[X],d)
k-means(C,w[X],d) = ∞, k-means separates all the
elements in S for large enough W .
We can now characterise the precise conditions under
which ratio-cut responds to weights. Ratio-cut responds to It can also be shown that the well-known min-sum objec-
weights on all data sets but those where cluster separation is tive function is also weight-separable.
both very large and highly uniform. Formally,
Theorem
P 3. Min-sum, which minimises the objective func-
Theorem 1. Given a clustering C of (X, s) where ev- tion
P
d(x, y) · w(x) · w(y), is weight-
Ci ∈C x,y∈Ci
ery cluster has more than one point, ratio-cut is weight- separable.
responsive on C if and only if either C is not perfect, or
C is not separation-uniform. Proof. The proof is similar to that of the previous theorem.
861
Weight-sensitive, weight-considering, and weight-robust non-negative real αi s. Similarly, `AL (X1 , X3 , d, w0 ) =
are defined as in the preliminaries section, with the above W 2 d(x1 ,x3 )+β1 W +β2
W 2 +β3 W +β4 for some non-negative real βi s.
definition for weight-responsive.
Dividing W , we see that `AL (X1 , X3 , d, w0 ) →
2
862
Partitional Hierarchical Acknowledgements
Weight k-means, k-medoids Ward’s method
Sensitive k-median, Min-sum Bisecting k-means
This work supported in part by the Sino-Danish Center for
Weight Ratio-cut Average-linkage the Theory of Interactive Computation, funded by the Dan-
Considering ish National Research Foundation and the National Science
Weight Min-diameter Single-linkage Foundation of China (under the grant 61061130540). The
Robust k-center Complete-linkage authors acknowledge support from the Center for research in
the Foundations of Electronic Markets (CFEM), supported
Table 1: Classification of weighted clustering algorithms. by the Danish Strategic Research Council. This work was
also supported by the Natural Sciences and Engineering Re-
search Council of Canada (NSERC) Alexander Graham Bell
We obtain bisecting k-means by setting P to k-means.
Canada Graduate Scholarship.
Other natural choices for P include min-sum, and exemplar-
based algorithms such as k-median. As shown above, many
of these partitional algorithms are weight-separable. We References
show that whenever P is weight-separable, then P-Divisive Ackerman, M., and Ben-David, S. 2011. Discerning linkage-
is weight-sensitive. The proof of the next theorem appears based algorithms among hierarchical clustering methods. In
in the appendix (Anonymous 2012). IJCAI.
Ackerman, M.; Ben-David, S.; and Loker, D. 2010a. Charac-
Theorem 6. If P is weight-separable then the P-Divisive terization of linkage-based clustering. In COLT.
algorithm is weight-sensitive. Ackerman, M.; Ben-David, S.; and Loker, D. 2010b. To-
wards property-based classification of clustering paradigms.
In NIPS.
Conclusions Agarwal, P. K., and Procopiuc, C. M. 1998. Exact and ap-
proximation algorithms for clustering. In SODA.
We study the behaviour of clustering algorithms on weighted
Anonymous. 2012. Weighted Clustering Appendix. http:
data, presenting three fundamental categories that describe
//wikisend.com/download/573754/clustering appendix.pdf.
how such algorithms respond to weights and classifying sev-
eral well-known algorithms according to these categories. Arthur, D., and Vassilvitskii, S. 2007. K-means++: The ad-
Our results are summarized in Table 1. We note that all of vantages of careful seeding. In SODA.
our results immediately translate to the standard setting, by Balcan, M. F.; Blum, A.; and Vempala, S. 2008. A discrim-
mapping each point with integer weight to the same number inative framework for clustering via similarity functions. In
of unweighted duplicates. STOC.
Our results can be used to aid in the selection of a clus- Bosagh-Zadeh, R., and Ben-David, S. 2009. A uniqueness
tering algorithm. For example, in the facility allocation ap- theorem for clustering. In UAI.
plication discussed in the introduction, where weights are Dasgupta, S., and Long, P. M. 2005. Performance guarantees
of primal importance, a weight-sensitive algorithm is suit- for hierarchical clustering. J. Comput. Syst. Sci. 70(4):555–
able. Other applications may call for weight-considering al- 569.
gorithms. This can occur when weights (i.e. number of du- Everitt, B. S. 1993. Cluster Analysis. John Wiley & Sons Inc.
plicates) should not be ignored, yet it is still desirable to Fisher, L., and Ness, J. V. 1971. Admissible clustering pro-
identify rare instances that constitute small but well-formed cedures. Biometrika 58:91–104.
outlier clusters. For example, this applies to patient data on
potential causes of a disease, where it is crucial to investi- Hartigan, J. 1981. Consistency of single linkage for high-
gate rare instances. While we do not argue that these con- density clusters. J. Amer. Statist. Assoc. 76(374):388–394.
siderations are always sufficient, they can provide valuable Jain, A. K.; Murty, M. N.; and Flynn, P. J. 1999. Data clus-
guidelines when clustering data that is weighted or contains tering: a review. ACM Comput. Surv. 31(3):264–323.
element duplicates. Kaufman, L., and Rousseeuw, P. J. 2008. Partitioning Around
Our analysis also reveals the following interesting phe- Medoids (Program PAM). John Wiley & Sons, Inc. 68–125.
nomenon: algorithms that are known to perform well in Ostrovsky, R.; Rabani, Y.; Schulman, L. J.; and Swamy, C.
practice (in the classical, unweighted setting), tend to be 2006. The effectiveness of Lloyd-type methods for the k-
more responsive to weights. For example, k-means is highly means problem. In FOCS.
responsive to weights while single linkage, which often per- Talagrand, M. 1996. A new look at independence. Ann.
forms poorly in practice (Hartigan 1981), is weight robust. Probab. 24(1):1–34.
We also study several k-means heuristics, specifically the Vapnik, V. 1998. Statistical Learning Theory. New York:
Lloyd algorithm with several methods of initialization and Wiley.
the PAM algorithm. These results were omitted due to a Von Luxburg, U. 2007. A tutorial on spectral clustering. J.
lack of space, but they are included in the appendix (Anony- Stat. Comput. 17(4):395–416.
mous 2012). Our analysis of these heuristics lends further
Wright, W. E. 1973. A formalization of cluster analysis. J.
support to the hypothesis that the more commonly applied
Pattern Recogn. 5(3):273–282.
algorithms are also more responsive to weights.
863