A Framework For Benchmarking Clustering Algorithms
A Framework For Benchmarking Clustering Algorithms
SoftwareX
journal homepage: www.elsevier.com/locate/softx
article info a b s t r a c t
Article history: The evaluation of clustering algorithms can involve running them on a variety of benchmark prob-
Received 20 September 2022 lems, and comparing their outputs to the reference, ground-truth groupings provided by experts.
Received in revised form 7 November 2022 Unfortunately, many research papers and graduate theses consider only a small number of datasets.
Accepted 15 November 2022
Also, the fact that there can be many equally valid ways to cluster a given problem set is rarely
Keywords: taken into account. In order to overcome these limitations, we have developed a framework whose
Clustering aim is to introduce a consistent methodology for testing clustering algorithms. Furthermore, we
Machine learning have aggregated, polished, and standardised many clustering benchmark dataset collections referred
Benchmark data to across the machine learning and data mining literature, and included new datasets of different
Noise points dimensionalities, sizes, and cluster types. An interactive datasets explorer, the documentation of the
External cluster validity Python API, a description of the ways to interact with the framework from other programming
Partition similarity score languages such as R or MATLAB, and other details are all provided at https://clustering-benchmarks.
gagolewski.com.
© 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
Code metadata
https://doi.org/10.1016/j.softx.2022.101270
2352-7110/© 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Marek Gagolewski SoftwareX 20 (2022) 101270
‘‘easy’’ for a method of interest were included. On the other hand, We can determine the confusion matrix C, where ci,j denotes
the researchers who generously share their data (e.g., [15,18–21]), the number of points in the ith reference cluster that the al-
unfortunately, might not make the interaction with their batteries gorithm assigned to the jth cluster. Even though such a matrix
particularly smooth, as each of them uses different file formats. summarises all the information required to judge the similarity
Furthermore, the existing repositories do not reflect the idea between the two partitions, if we wish to compare the quality
there might be many equally valid/plausible/useful partitions of of different algorithms, we would rather have it aggregated in
the same dataset; see [2,22] for discussion. the form of a single number. As one of the many external clus-
On the other hand, some well-agreed-upon benchmark prob- ter validity indices (see, e.g., [12–14]), we can use the adjusted
lems for a long time have existed in other machine learning asymmetric accuracy [11] given by:
domains (classification and regression datasets from the afore- 1
∑k ci,σ (i) 1
maxσ k i=1 ci,·
− k
mentioned UCI [15]; but also test functions for testing global AAA(C) = 1
optimisation solvers, e.g., [23,24]). 1− k
In order to overcome these gaps, the current project proposes
( k
)
1 ∑ ci,1 + · · · + ci,k − ci,σ (i)
a consistent framework for benchmarking clustering algorithms. = 1 − min k−1
,
σ k (ci,1 + · · · + ci,k )
Its description is given in the next section. Then, in Section 3, we i=1 k
describe a Python API (the clustering-benchmarks package avail-
which can be thought of as a measure of the average proportion of
able at PyPI; see https://pypi.org/project/clustering-benchmarks/)
correctly identified points in each cluster (‘‘above’’ the random as-
that makes the interaction therewith relatively easy. Section 4 signment). As the actual cluster IDs do not matter (a partition is a
concludes the paper and proposes a few ideas for the future set of clusters and sets are, by definition, unordered), the optimal
evolution of this framework. matching between the cluster labels is performed automatically
by finding the best permutation σ of the set {1, . . . , k}.
2. Methodology
There can be many valid partitions. What is more, it is in the
very spirit of unsupervised learning that, in many cases, there
We have compiled a quite large suite of example real and
might be many equally valid ways to split a given dataset. An
simulated benchmark datasets. For reproducibility, the releases
algorithm should be rewarded for finding a partition that closely
of our suite are versioned: e.g., https://github.com/gagolews/
matches any of the reference ones. This might require running the
clustering-data-v1/releases/tag/v1.1.0 links to a revision that has
method multiple times (unless it is a hierarchical one) to find the
been published in September 2022 [25]. Currently, there are nine
clusterings of different cardinalities. Then, the generated outputs
batteries (collections), each of which features several datasets of are evaluated against all the available reference labellings and the
different origins, dimensionalities, size imbalancedness, and level maximal similarity score is reported.
of overlap; including, but not limited to, [12,15,18–21,26–32]1 ;
see the project’s homepage for the detailed list. Noise points. Also, to make the clustering problem more difficult,
Note that the datasets and the described software are inde- some datasets might feature noise points (e.g., outliers or irrel-
pendent of each other. Thanks to this, new datasets can easily evant points in between the actual clusters). They are specially
be added in the future. Also, the users are free to use their own marked in the ground-truth vectors: we assigned them cluster
collections or access data from within other programming envi- IDs of 0; compare the right subfigure of Fig. 1, where they are
ronments. The current framework defines the suggested unified coloured grey. A clustering algorithm must never be informed
file format which is detailed on the project’s homepage. about the location of such ‘‘problematic’’ points. Once the par-
tition of the dataset is determined, they are excluded from the
Reference partitions. When referring to a particular benchmark computation of the external cluster validity measures. In other
problem, we use the convention ‘‘battery/dataset’’, e.g., ‘‘wut/x2’’. words, it does not matter to which clusters the noise points are
Let X be one of such datasets that consists of n points in Rd . allocated.
Each dataset is equipped with a reference partition assigned by
experts. Such a grouping of the points into k ≥ 2 clusters is 3. The Python API
encoded using a label vector y, where yi ∈ {1, . . . , k} gives the
cluster ID of the ith object. For instance, the left subfigure of Fig. 1 To facilitate the employment of the aforementioned frame-
depicts the ground-truth 3-clustering of wut/x2 (which is based work, we have implemented an open-source package for Python
on the information about how this dataset has been generated named clustering-benchmarks. It can be installed from PyPI (https:
from a mixture of three Gaussian distributions). //pypi.org/project/clustering-benchmarks/), e.g., via a call to pip3
install clustering-benchmarks. Then, it can be imported
Running the algorithm in question. Let us consider a clustering by calling:
algorithm whose quality we would like to assess. When we apply
it on X to discover a new k-partition (in an unsupervised manner, import clustbench # clustering-benchmarks
i.e., without revealing the true y), we obtain a vector of predicted import os.path, genieclust, sklearn.cluster
labels encoding a new grouping, ŷ. For example, the first row of # we will need these later
scatterplots in Fig. 2 depicts the 3-partitions of wut/x2 discovered
by three different methods. Fetching benchmark data. The example datasets repository [25]
(or any custom repository provided by the user) can be queried
Assessing partition similarity. Ideally, we would like to work with easily. Let us assume that we store it in the following directory:
algorithms that yield partitions closely matching the reference
ones. This should be true on as wide a set of problems as possible. data_path = os.path.join("˜",
Hence, we need to relate the predicted labels to the reference "Projects", "clustering-data-v1") # example
ones.
A particular dataset (here, for example: wut/x2) can be ac-
1 The original datasets were not equipped with alternative labellings nor with cessed by calling:
noise point markers; these were added by the current author.
2
Marek Gagolewski SoftwareX 20 (2022) 101270
Fig. 1. An example benchmark dataset (wut/x2) and the two corresponding reference partitions (k = 3 and k = 4; noise points marked in grey).
Fig. 2. Clusterings of an example dataset (wut/x2) discovered by Genie (g = 0.3) [33,34], k-means, and ITM [35] (k = 3 and k = 4). Confusion matrices and adjusted
asymmetric accuracies (AAA; [11]; comparisons against the reference partitions depicted in Fig. 1) are also reported. Note that the second ground-truth partition
features some noise points: hence, in the k = 4 case, the first row of the confusion matrix is not taken into account.
The above call returns a named tuple, whose data field gives Fetching precomputed results. Suppose we would like to
the data matrix, labels gives the list of all ground-truth par- study some precomputed clustering results (see https:
titions (encoded as label vectors), and n_clusters gives the //github.com/gagolews/clustering-results-v1) which we store
corresponding numbers of subsets. For instance, here is a way in locally in the following directory:
which we have generated Fig. 1.
results_path = os.path.join("˜", "Projects",
for i in range(len(b.labels)): "clustering-results-v1", "original")
plt.subplot(1,len(b.labels), i+1)
genieclust.plots.plot_scatter(
The partitions can be fetched by calling:
3
Marek Gagolewski SoftwareX 20 (2022) 101270
We thus have got access to data on the Genie [33,34] algorithm This research was supported by the Australian Research Coun-
with different gini_threshold (g) parameter settings (g = 1.0 cil Discovery Project ARC DP210100227.
gives the single linkage method). Documentation and data are publicly available at https://clust
ering-benchmarks.gagolewski.com, https://github.com/gagolews/
Computing external cluster validity measures. Here is a way to clustering-data-v1, and https://github.com/gagolews/clustering-r
compute the external cluster validity measures: esults-v1.
A big thank-you to all the researchers who share their datasets
round(clustbench.get_score(b.labels, with the clustering community.
res["Genie_G0.3"]), 2)
## 0.87 References
By default, adjusted asymmetric accuracy (AAA; [11]) is ap-
[1] Hennig C. What are the true clusters? Pattern Recognit Lett 2015;64:53–62.
plied, but this might be changed to any other score by setting the http://dx.doi.org/10.1016/j.patrec.2015.04.009.
metric argument explicitly. As explained above, we compare the [2] von Luxburg U, Williamson R, Guyon I. Clustering: Science or art? In:
predicted clusterings against all the reference partitions (ignoring Guyon I, et al., editors. Proc. ICML workshop on unsupervised and transfer
the noise points), and report the maximal score. learning. Proc. machine learning research, vol. 27, 2012, p. 65–79.
[3] Van Mechelen I, et al. Benchmarking in cluster analysis: A white paper.
Applying clustering methods manually. We can use 2018, https://arxiv.org/pdf/1809.10496.pdf.
clustbench.fit_predict_many to generate all the partitions [4] Ackerman M, Ben-David S, Brânzei S, Loker D. Weighted clustering:
required to compare ourselves against the reference labels. Let Towards solving the user’s dilemma. Pattern Recognit 2021;120:108152.
us test the k-means algorithm as implemented in the scikit-learn http://dx.doi.org/10.1016/j.patcog.2021.108152.
package [36]: [5] Xiong H, Li Z. Clustering validation measures. In: Aggarwal C, Reddy C,
editors. Data clustering: Algorithms and applications. CRC Press; 2014, p.
m = sklearn.cluster.KMeans() 571–606.
res["KMeans"] = [6] Tavakkol B, Choi J, Jeong M, Albin S. Object-based cluster validation with
clustbench.fit_predict_many(m, b.data, b.n_clusters) densities. Pattern Recognit 2022;121:108223. http://dx.doi.org/10.1016/j.
round(clustbench.get_score(b.labels, res["KMeans"]), 2) patcog.2021.108223.
[7] Milligan G, Cooper M. An examination of procedures for determining the
## 0.98
number of clusters in a data set. Psychometrika 1985;50(2):159–79.
[8] Maulik U, Bandyopadhyay S. Performance evaluation of some clustering
We see that k-means (which specialises in detecting sym-
algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell
metric Gaussian-like blobs) performs better than Genie on this 2002;24(12):1650–4. http://dx.doi.org/10.1109/TPAMI.2002.1114856.
particular dataset; see Fig. 2 for an illustration (also featuring [9] Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez J, Perona I. An exten-
the results generated by the ITM method [35]). The project’s sive comparative study of cluster validity indices. Pattern Recognit
homepage and documentation discuss many more functions. 2013;46(1):243–56. http://dx.doi.org/10.1016/j.patcog.2012.07.021.
[10] Gagolewski M, Bartoszuk M, Cena A. Are cluster validity measures
4. Conclusion (in)valid? Inform Sci 2021;581:620–36. http://dx.doi.org/10.1016/j.ins.
2021.10.004.
[11] Gagolewski M. Adjusted asymmetric accuracy: A well-behaving external
The current project is designed to be extensible so that it can
cluster validity measure, (preprint) 2022 [submitted for publication], https:
accommodate new datasets and/or label vectors in the future — //doi.org/10.48550/arXiv.2209.02935, https://arxiv.org/pdf/2209.02935.pdf.
so as to make the clustering algorithm evaluation much more [12] Rezaei M, Fränti P. Set matching measures for external cluster validity.
rigorous. Any contributions are warmly welcome; see https:// IEEE Trans Knowl Data Eng 2016;28(8):2173–86. http://dx.doi.org/10.1109/
github.com/gagolews/clustering-benchmarks/issues for a feature TKDE.2016.2551240.
request and bug tracker. In particular, we have implemented an [13] Wagner S, Wagner D. Comparing clusterings – An overview. Tech. rep.
interactive standalone application that can be used for preparing 2006-04, Faculty of Informatics, Universität Karlsruhe (TH); 2006, URL
https://i11www.iti.kit.edu/extra/publications/ww-cco-06.pdf.
our own two-dimensional datasets (Colouriser).
[14] Horta D, Campello R. Comparing hard and overlapping clusterings. J Mach
Future versions of the benchmark suite will include methods Learn Res 2015;16(93):2949–97.
for generating random samples of arbitrary sizes/cluster size dis- [15] Dua D, Graff C. UCI machine learning repository. 2022, http://archive.ics.
tribution similar to a given dataset (e.g., with more noise points). uci.edu/ml.
Thanks to this, in the case of algorithms that feature many tunable [16] Ullmann T, Beer A, Hünemörder M, Seidl T, Boulesteix A-L. Over-optimistic
parameters, it will be possible to implement some means to evaluation and reporting of novel cluster algorithms: An illustrative study.
separate validation datasets (where we are allowed to learn the Adv Data Anal Classif 2022. http://dx.doi.org/10.1007/s11634-022-00496-
5.
‘‘best’’ settings; see, e.g., [17] and the references therein) from
[17] Ullmann T, Hennig C, Boulesteix A-L. Validation of cluster analysis results
the testing ones (used in the final comparisons), which is a quite on validation data: A systematic framework. Wiley Interdiscip Rev: Data
standard approach in other machine learning domains. Min Knowl Dis 2021;12(3):e14444. http://dx.doi.org/10.1002/widm.1444.
Moreover, the framework can be extended to cover overlap- [18] Graves D, Pedrycz W. Kernel-based fuzzy clustering and fuzzy clus-
ping clusterings as well as semi-supervised learning tasks, where tering: A comparative experimental study. Fuzzy Sets and Systems
an algorithm knows about the right assignment of some of the 2010;161:522–43. http://dx.doi.org/10.1016/j.fss.2009.10.021.
input points in advance. [19] Ultsch A. Clustering with SOM: U*C. In: Workshop on self-organizing maps.
2005, p. 75–82.
[20] Thrun M, Ultsch A. Clustering benchmark datasets exploiting the funda-
Declaration of competing interest
mental clustering problems. Data Brief 2020;30:105501. http://dx.doi.org/
10.1016/j.dib.2020.105501.
The authors declare that they have no known competing finan- [21] Fränti P, Sieranoja S. K-means properties on six clustering bench-
cial interests or personal relationships that could have appeared mark datasets. Appl Intell 2018;48(12):4743–59. http://dx.doi.org/10.1007/
to influence the work reported in this paper. s10489-018-1238-7.
4
Marek Gagolewski SoftwareX 20 (2022) 101270
[22] Dasgupta S, Ng V. Single data, multiple clusterings. In: Proc. NIPS workshop [29] McInnes L, Healy J, Astels S. hdbscan: Hierarchical density based cluster-
clustering: Science or art? Towards principled approaches. 2009. ing. J Open Source Softw 2017;2(11):205. http://dx.doi.org/10.21105/joss.
[23] Jamil M, Yang X-S, Zepernick H-J. 8-test functions for global optimization: 00205.
A comprehensive survey. In: Swarm intelligence and bio-inspired com- [30] Fränti P, Virmajoki O. Iterative shrinking method for clustering problems.
putation. 2013, p. 193–222. http://dx.doi.org/10.1016/B978-0-12-405163- Pattern Recognit 2006;39(5):761–5.
8.00008-9. [31] Sieranoja S, Fränti P. Fast and general density peaks clustering. Pattern
[24] Weise T, et al. Benchmarking optimization algorithms: An open source Recognit Lett 2019;128:551–8. http://dx.doi.org/10.1016/j.patrec.2019.10.
framework for the traveling salesman problem. IEEE Comput Intell Mag 019.
2014;9(3):40–52. http://dx.doi.org/10.1109/MCI.2014.2326101. [32] Jain A, Law M. Data clustering: A user’s dilemma. Lecture Notes in Comput
[25] Gagolewski M, et al. A benchmark suite for clustering algorithms: Version Sci 2005;3776:1–10.
1.1.0. 2022, http://dx.doi.org/10.5281/zenodo.7088171, URL https://github. [33] Gagolewski M, Bartoszuk M, Cena A. Genie: A new, fast, and outlier-
com/gagolews/clustering-data-v1/releases/tag/v1.1.0. resistant hierarchical clustering algorithm. Inform Sci 2016;363:8–23. http:
[26] Thrun M, Stier Q. Fundamental clustering algorithms suite. SoftwareX //dx.doi.org/10.1016/j.ins.2016.05.003.
2021;13:100642. http://dx.doi.org/10.1016/j.softx.2020.100642. [34] Gagolewski M. genieclust: Fast and robust hierarchical clustering.
[27] Karypis G, Han E, Kumar V. CHAMELEON: Hierarchical clustering using SoftwareX 2021;15:100722. http://dx.doi.org/10.1016/j.softx.2021.100722.
dynamic modeling. Computer 1999;32(8):68–75. http://dx.doi.org/10.1109/ [35] Müller A, Nowozin S, Lampert C. Information theoretic clustering us-
2.781637. ing minimum spanning trees. In: Proc. German conference on pattern
[28] Bezdek J, Keller J, Krishnapuram R, Kuncheva L, Pal N. Will the real recognition. 2012.
iris data please stand up? IEEE Trans Fuzzy Syst 1999;7(3):368–9. http: [36] Pedregosa F, et al. Scikit-learn: Machine learning in Python. J Mach Learn
//dx.doi.org/10.1109/91.771092. Res 2011;12(85):2825–30.