0% found this document useful (0 votes)

10 views5 pages

A Framework For Benchmarking Clustering Algorithms

Uploaded by

dridabs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views5 pages

A Framework For Benchmarking Clustering Algorithms

Uploaded by

dridabs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

SoftwareX 20 (2022) 101270

Contents lists available at ScienceDirect

SoftwareX
journal homepage: www.elsevier.com/locate/softx

Original software publication

A framework for benchmarking clustering algorithms

∗
Marek Gagolewski
Warsaw University of Technology, Faculty of Mathematics and Information Science, ul. Koszykowa 75, 00-662 Warsaw, Poland
Deakin University, Data to Intelligence Research Centre, School of IT, Geelong, VIC 3220, Australia

article info a b s t r a c t

Article history: The evaluation of clustering algorithms can involve running them on a variety of benchmark prob-
Received 20 September 2022 lems, and comparing their outputs to the reference, ground-truth groupings provided by experts.
Received in revised form 7 November 2022 Unfortunately, many research papers and graduate theses consider only a small number of datasets.
Accepted 15 November 2022
Also, the fact that there can be many equally valid ways to cluster a given problem set is rarely
Keywords: taken into account. In order to overcome these limitations, we have developed a framework whose
Clustering aim is to introduce a consistent methodology for testing clustering algorithms. Furthermore, we
Machine learning have aggregated, polished, and standardised many clustering benchmark dataset collections referred
Benchmark data to across the machine learning and data mining literature, and included new datasets of different
Noise points dimensionalities, sizes, and cluster types. An interactive datasets explorer, the documentation of the
External cluster validity Python API, a description of the ways to interact with the framework from other programming
Partition similarity score languages such as R or MATLAB, and other details are all provided at https://clustering-benchmarks.
gagolewski.com.
© 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).

Code metadata

Current code version 1.1.1

Permanent link to code repository https://github.com/ElsevierSoftwareX/SOFTX-D-22-00288
Legal code license GNU AGPL v3
Code versioning system used git
Software code language used Python
Compilation requirements, operating environments and dependencies Python 3.7+ with numpy, scipy, pandas, matplotlib, scikit-learn, and genieclust
Link to developer documentation/manual https://clustering-benchmarks.gagolewski.com
Feature requests and bug tracker https://github.com/gagolews/clustering-benchmarks/issues

1. Introduction quality of partitions into a single number [7–9]. In practice, they

can only focus on a single property of a given split (e.g., set
Cluster analysis [1–3] is a data mining task where we dis- separability or compactness) and the partitions they promote
cover semantically useful dataset partitions in a purely unsu- might be far from sound [10].
pervised manner. We know that there is no single ‘‘best’’ all- Another approach is to use the external validity measures
purpose algorithm [4], but some methods are better than others [11–14] that quantify the similarity between the generated clus-
for certain problem types. However, a lot is still yet to be done terings and the reference (ground-truth) partitions provided by
[3,5,6] with regard to separating promising approaches from the experts.
systematically disappointing ones. Unfortunately, it is not rare for research papers and graduate
One approach to clustering validation relies on using the so- theses to consider only a small number of benchmark datasets.
called internal measures, which are supposed to summarise the We regularly come across the same 5–10 test problems from the
UCI [15] database. This is obviously too few to make any evalu-
∗ Correspondence to: Deakin University, Data to Intelligence Research Centre, ation rigorous enough and thus may lead to overfitting [16,17].
School of IT, Geelong, VIC 3220, Australia. Some authors propose their own datasets, but do not test their
E-mail address: [email protected]. methods against other benchmark batteries. This might give rise
URL: https://www.gagolewski.com. to biased conclusions, as there is a risk that only the problems

https://doi.org/10.1016/j.softx.2022.101270
2352-7110/© 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Marek Gagolewski SoftwareX 20 (2022) 101270

‘‘easy’’ for a method of interest were included. On the other hand, We can determine the confusion matrix C, where ci,j denotes
the researchers who generously share their data (e.g., [15,18–21]), the number of points in the ith reference cluster that the al-
unfortunately, might not make the interaction with their batteries gorithm assigned to the jth cluster. Even though such a matrix
particularly smooth, as each of them uses different file formats. summarises all the information required to judge the similarity
Furthermore, the existing repositories do not reflect the idea between the two partitions, if we wish to compare the quality
there might be many equally valid/plausible/useful partitions of of different algorithms, we would rather have it aggregated in
the same dataset; see [2,22] for discussion. the form of a single number. As one of the many external clus-
On the other hand, some well-agreed-upon benchmark prob- ter validity indices (see, e.g., [12–14]), we can use the adjusted
lems for a long time have existed in other machine learning asymmetric accuracy [11] given by:
domains (classification and regression datasets from the afore- 1
∑k ci,σ (i) 1
maxσ k i=1 ci,·
− k
mentioned UCI [15]; but also test functions for testing global AAA(C) = 1
optimisation solvers, e.g., [23,24]). 1− k
In order to overcome these gaps, the current project proposes
( k
)
1 ∑ ci,1 + · · · + ci,k − ci,σ (i)
a consistent framework for benchmarking clustering algorithms. = 1 − min k−1
,
σ k (ci,1 + · · · + ci,k )
Its description is given in the next section. Then, in Section 3, we i=1 k
describe a Python API (the clustering-benchmarks package avail-
which can be thought of as a measure of the average proportion of
able at PyPI; see https://pypi.org/project/clustering-benchmarks/)
correctly identified points in each cluster (‘‘above’’ the random as-
that makes the interaction therewith relatively easy. Section 4 signment). As the actual cluster IDs do not matter (a partition is a
concludes the paper and proposes a few ideas for the future set of clusters and sets are, by definition, unordered), the optimal
evolution of this framework. matching between the cluster labels is performed automatically
by finding the best permutation σ of the set {1, . . . , k}.
2. Methodology
There can be many valid partitions. What is more, it is in the
very spirit of unsupervised learning that, in many cases, there
We have compiled a quite large suite of example real and
might be many equally valid ways to split a given dataset. An
simulated benchmark datasets. For reproducibility, the releases
algorithm should be rewarded for finding a partition that closely
of our suite are versioned: e.g., https://github.com/gagolews/
matches any of the reference ones. This might require running the
clustering-data-v1/releases/tag/v1.1.0 links to a revision that has
method multiple times (unless it is a hierarchical one) to find the
been published in September 2022 [25]. Currently, there are nine
clusterings of different cardinalities. Then, the generated outputs
batteries (collections), each of which features several datasets of are evaluated against all the available reference labellings and the
different origins, dimensionalities, size imbalancedness, and level maximal similarity score is reported.
of overlap; including, but not limited to, [12,15,18–21,26–32]1 ;
see the project’s homepage for the detailed list. Noise points. Also, to make the clustering problem more difficult,
Note that the datasets and the described software are inde- some datasets might feature noise points (e.g., outliers or irrel-
pendent of each other. Thanks to this, new datasets can easily evant points in between the actual clusters). They are specially
be added in the future. Also, the users are free to use their own marked in the ground-truth vectors: we assigned them cluster
collections or access data from within other programming envi- IDs of 0; compare the right subfigure of Fig. 1, where they are
ronments. The current framework defines the suggested unified coloured grey. A clustering algorithm must never be informed
file format which is detailed on the project’s homepage. about the location of such ‘‘problematic’’ points. Once the par-
tition of the dataset is determined, they are excluded from the
Reference partitions. When referring to a particular benchmark computation of the external cluster validity measures. In other
problem, we use the convention ‘‘battery/dataset’’, e.g., ‘‘wut/x2’’. words, it does not matter to which clusters the noise points are
Let X be one of such datasets that consists of n points in Rd . allocated.
Each dataset is equipped with a reference partition assigned by
experts. Such a grouping of the points into k ≥ 2 clusters is 3. The Python API
encoded using a label vector y, where yi ∈ {1, . . . , k} gives the
cluster ID of the ith object. For instance, the left subfigure of Fig. 1 To facilitate the employment of the aforementioned frame-
depicts the ground-truth 3-clustering of wut/x2 (which is based work, we have implemented an open-source package for Python
on the information about how this dataset has been generated named clustering-benchmarks. It can be installed from PyPI (https:
from a mixture of three Gaussian distributions). //pypi.org/project/clustering-benchmarks/), e.g., via a call to pip3
install clustering-benchmarks. Then, it can be imported
Running the algorithm in question. Let us consider a clustering by calling:
algorithm whose quality we would like to assess. When we apply
it on X to discover a new k-partition (in an unsupervised manner, import clustbench # clustering-benchmarks
i.e., without revealing the true y), we obtain a vector of predicted import os.path, genieclust, sklearn.cluster
labels encoding a new grouping, ŷ. For example, the first row of # we will need these later
scatterplots in Fig. 2 depicts the 3-partitions of wut/x2 discovered
by three different methods. Fetching benchmark data. The example datasets repository [25]
(or any custom repository provided by the user) can be queried
Assessing partition similarity. Ideally, we would like to work with easily. Let us assume that we store it in the following directory:
algorithms that yield partitions closely matching the reference
ones. This should be true on as wide a set of problems as possible. data_path = os.path.join("˜",
Hence, we need to relate the predicted labels to the reference "Projects", "clustering-data-v1") # example
ones.
A particular dataset (here, for example: wut/x2) can be ac-
1 The original datasets were not equipped with alternative labellings nor with cessed by calling:
noise point markers; these were added by the current author.

2
Marek Gagolewski SoftwareX 20 (2022) 101270

Fig. 1. An example benchmark dataset (wut/x2) and the two corresponding reference partitions (k = 3 and k = 4; noise points marked in grey).

Fig. 2. Clusterings of an example dataset (wut/x2) discovered by Genie (g = 0.3) [33,34], k-means, and ITM [35] (k = 3 and k = 4). Confusion matrices and adjusted
asymmetric accuracies (AAA; [11]; comparisons against the reference partitions depicted in Fig. 1) are also reported. Note that the second ground-truth partition
features some noise points: hence, in the k = 4 case, the first row of the confusion matrix is not taken into account.

battery, dataset = "wut", "x2" b.data, labels=b.labels[i]-1,

b = clustbench.load_dataset(battery, dataset, axis="equal", title=f"labels{i}")
path=data_path) plt.show()

The above call returns a named tuple, whose data field gives Fetching precomputed results. Suppose we would like to
the data matrix, labels gives the list of all ground-truth par- study some precomputed clustering results (see https:
titions (encoded as label vectors), and n_clusters gives the //github.com/gagolews/clustering-results-v1) which we store
corresponding numbers of subsets. For instance, here is a way in locally in the following directory:
which we have generated Fig. 1.
results_path = os.path.join("˜", "Projects",
for i in range(len(b.labels)): "clustering-results-v1", "original")
plt.subplot(1,len(b.labels), i+1)
genieclust.plots.plot_scatter(
The partitions can be fetched by calling:
3
Marek Gagolewski SoftwareX 20 (2022) 101270

res = clustbench.load_results("Genie", b.battery, Data availability

b.dataset,b.n_clusters, path=results_path)
print(list(res.keys())) Data are publicly available.
##['Genie_G0.1','Genie_G0.3', 'Genie_G0.5',
'Genie_G0.7','Genie_G1.0'] Acknowledgements

We thus have got access to data on the Genie [33,34] algorithm This research was supported by the Australian Research Coun-
with different gini_threshold (g) parameter settings (g = 1.0 cil Discovery Project ARC DP210100227.
gives the single linkage method). Documentation and data are publicly available at https://clust
ering-benchmarks.gagolewski.com, https://github.com/gagolews/
Computing external cluster validity measures. Here is a way to clustering-data-v1, and https://github.com/gagolews/clustering-r
compute the external cluster validity measures: esults-v1.
A big thank-you to all the researchers who share their datasets
round(clustbench.get_score(b.labels, with the clustering community.
res["Genie_G0.3"]), 2)
## 0.87 References
By default, adjusted asymmetric accuracy (AAA; [11]) is ap-
[1] Hennig C. What are the true clusters? Pattern Recognit Lett 2015;64:53–62.
plied, but this might be changed to any other score by setting the http://dx.doi.org/10.1016/j.patrec.2015.04.009.
metric argument explicitly. As explained above, we compare the [2] von Luxburg U, Williamson R, Guyon I. Clustering: Science or art? In:
predicted clusterings against all the reference partitions (ignoring Guyon I, et al., editors. Proc. ICML workshop on unsupervised and transfer
the noise points), and report the maximal score. learning. Proc. machine learning research, vol. 27, 2012, p. 65–79.
[3] Van Mechelen I, et al. Benchmarking in cluster analysis: A white paper.
Applying clustering methods manually. We can use 2018, https://arxiv.org/pdf/1809.10496.pdf.
clustbench.fit_predict_many to generate all the partitions [4] Ackerman M, Ben-David S, Brânzei S, Loker D. Weighted clustering:
required to compare ourselves against the reference labels. Let Towards solving the user’s dilemma. Pattern Recognit 2021;120:108152.
us test the k-means algorithm as implemented in the scikit-learn http://dx.doi.org/10.1016/j.patcog.2021.108152.
package [36]: [5] Xiong H, Li Z. Clustering validation measures. In: Aggarwal C, Reddy C,
editors. Data clustering: Algorithms and applications. CRC Press; 2014, p.
m = sklearn.cluster.KMeans() 571–606.
res["KMeans"] = [6] Tavakkol B, Choi J, Jeong M, Albin S. Object-based cluster validation with
clustbench.fit_predict_many(m, b.data, b.n_clusters) densities. Pattern Recognit 2022;121:108223. http://dx.doi.org/10.1016/j.
round(clustbench.get_score(b.labels, res["KMeans"]), 2) patcog.2021.108223.
[7] Milligan G, Cooper M. An examination of procedures for determining the
## 0.98
number of clusters in a data set. Psychometrika 1985;50(2):159–79.
[8] Maulik U, Bandyopadhyay S. Performance evaluation of some clustering
We see that k-means (which specialises in detecting sym-
algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell
metric Gaussian-like blobs) performs better than Genie on this 2002;24(12):1650–4. http://dx.doi.org/10.1109/TPAMI.2002.1114856.
particular dataset; see Fig. 2 for an illustration (also featuring [9] Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez J, Perona I. An exten-
the results generated by the ITM method [35]). The project’s sive comparative study of cluster validity indices. Pattern Recognit
homepage and documentation discuss many more functions. 2013;46(1):243–56. http://dx.doi.org/10.1016/j.patcog.2012.07.021.
[10] Gagolewski M, Bartoszuk M, Cena A. Are cluster validity measures
4. Conclusion (in)valid? Inform Sci 2021;581:620–36. http://dx.doi.org/10.1016/j.ins.
2021.10.004.
[11] Gagolewski M. Adjusted asymmetric accuracy: A well-behaving external
The current project is designed to be extensible so that it can
cluster validity measure, (preprint) 2022 [submitted for publication], https:
accommodate new datasets and/or label vectors in the future — //doi.org/10.48550/arXiv.2209.02935, https://arxiv.org/pdf/2209.02935.pdf.
so as to make the clustering algorithm evaluation much more [12] Rezaei M, Fränti P. Set matching measures for external cluster validity.
rigorous. Any contributions are warmly welcome; see https:// IEEE Trans Knowl Data Eng 2016;28(8):2173–86. http://dx.doi.org/10.1109/
github.com/gagolews/clustering-benchmarks/issues for a feature TKDE.2016.2551240.
request and bug tracker. In particular, we have implemented an [13] Wagner S, Wagner D. Comparing clusterings – An overview. Tech. rep.
interactive standalone application that can be used for preparing 2006-04, Faculty of Informatics, Universität Karlsruhe (TH); 2006, URL
https://i11www.iti.kit.edu/extra/publications/ww-cco-06.pdf.
our own two-dimensional datasets (Colouriser).
[14] Horta D, Campello R. Comparing hard and overlapping clusterings. J Mach
Future versions of the benchmark suite will include methods Learn Res 2015;16(93):2949–97.
for generating random samples of arbitrary sizes/cluster size dis- [15] Dua D, Graff C. UCI machine learning repository. 2022, http://archive.ics.
tribution similar to a given dataset (e.g., with more noise points). uci.edu/ml.
Thanks to this, in the case of algorithms that feature many tunable [16] Ullmann T, Beer A, Hünemörder M, Seidl T, Boulesteix A-L. Over-optimistic
parameters, it will be possible to implement some means to evaluation and reporting of novel cluster algorithms: An illustrative study.
separate validation datasets (where we are allowed to learn the Adv Data Anal Classif 2022. http://dx.doi.org/10.1007/s11634-022-00496-
5.
‘‘best’’ settings; see, e.g., [17] and the references therein) from
[17] Ullmann T, Hennig C, Boulesteix A-L. Validation of cluster analysis results
the testing ones (used in the final comparisons), which is a quite on validation data: A systematic framework. Wiley Interdiscip Rev: Data
standard approach in other machine learning domains. Min Knowl Dis 2021;12(3):e14444. http://dx.doi.org/10.1002/widm.1444.
Moreover, the framework can be extended to cover overlap- [18] Graves D, Pedrycz W. Kernel-based fuzzy clustering and fuzzy clus-
ping clusterings as well as semi-supervised learning tasks, where tering: A comparative experimental study. Fuzzy Sets and Systems
an algorithm knows about the right assignment of some of the 2010;161:522–43. http://dx.doi.org/10.1016/j.fss.2009.10.021.
input points in advance. [19] Ultsch A. Clustering with SOM: U*C. In: Workshop on self-organizing maps.
2005, p. 75–82.
[20] Thrun M, Ultsch A. Clustering benchmark datasets exploiting the funda-
Declaration of competing interest
mental clustering problems. Data Brief 2020;30:105501. http://dx.doi.org/
10.1016/j.dib.2020.105501.
The authors declare that they have no known competing finan- [21] Fränti P, Sieranoja S. K-means properties on six clustering bench-
cial interests or personal relationships that could have appeared mark datasets. Appl Intell 2018;48(12):4743–59. http://dx.doi.org/10.1007/
to influence the work reported in this paper. s10489-018-1238-7.

4
Marek Gagolewski SoftwareX 20 (2022) 101270

[22] Dasgupta S, Ng V. Single data, multiple clusterings. In: Proc. NIPS workshop [29] McInnes L, Healy J, Astels S. hdbscan: Hierarchical density based cluster-
clustering: Science or art? Towards principled approaches. 2009. ing. J Open Source Softw 2017;2(11):205. http://dx.doi.org/10.21105/joss.
[23] Jamil M, Yang X-S, Zepernick H-J. 8-test functions for global optimization: 00205.
A comprehensive survey. In: Swarm intelligence and bio-inspired com- [30] Fränti P, Virmajoki O. Iterative shrinking method for clustering problems.
putation. 2013, p. 193–222. http://dx.doi.org/10.1016/B978-0-12-405163- Pattern Recognit 2006;39(5):761–5.
8.00008-9. [31] Sieranoja S, Fränti P. Fast and general density peaks clustering. Pattern
[24] Weise T, et al. Benchmarking optimization algorithms: An open source Recognit Lett 2019;128:551–8. http://dx.doi.org/10.1016/j.patrec.2019.10.
framework for the traveling salesman problem. IEEE Comput Intell Mag 019.
2014;9(3):40–52. http://dx.doi.org/10.1109/MCI.2014.2326101. [32] Jain A, Law M. Data clustering: A user’s dilemma. Lecture Notes in Comput
[25] Gagolewski M, et al. A benchmark suite for clustering algorithms: Version Sci 2005;3776:1–10.
1.1.0. 2022, http://dx.doi.org/10.5281/zenodo.7088171, URL https://github. [33] Gagolewski M, Bartoszuk M, Cena A. Genie: A new, fast, and outlier-
com/gagolews/clustering-data-v1/releases/tag/v1.1.0. resistant hierarchical clustering algorithm. Inform Sci 2016;363:8–23. http:
[26] Thrun M, Stier Q. Fundamental clustering algorithms suite. SoftwareX //dx.doi.org/10.1016/j.ins.2016.05.003.
2021;13:100642. http://dx.doi.org/10.1016/j.softx.2020.100642. [34] Gagolewski M. genieclust: Fast and robust hierarchical clustering.
[27] Karypis G, Han E, Kumar V. CHAMELEON: Hierarchical clustering using SoftwareX 2021;15:100722. http://dx.doi.org/10.1016/j.softx.2021.100722.
dynamic modeling. Computer 1999;32(8):68–75. http://dx.doi.org/10.1109/ [35] Müller A, Nowozin S, Lampert C. Information theoretic clustering us-
2.781637. ing minimum spanning trees. In: Proc. German conference on pattern
[28] Bezdek J, Keller J, Krishnapuram R, Kuncheva L, Pal N. Will the real recognition. 2012.
iris data please stand up? IEEE Trans Fuzzy Syst 1999;7(3):368–9. http: [36] Pedregosa F, et al. Scikit-learn: Machine learning in Python. J Mach Learn
//dx.doi.org/10.1109/91.771092. Res 2011;12(85):2825–30.

Databases For Data Mining
No ratings yet
Databases For Data Mining
2 pages
Plagiarism
No ratings yet
Plagiarism
18 pages
Clustering in Python-Dr. Afsaneh Javadi
No ratings yet
Clustering in Python-Dr. Afsaneh Javadi
8 pages
PythonDASE - 2025 Version1
No ratings yet
PythonDASE - 2025 Version1
44 pages
Babel A Generic Benchmarking Platform
No ratings yet
Babel A Generic Benchmarking Platform
10 pages
IssuesChallenges and Tools of Clustering Algorithm
No ratings yet
IssuesChallenges and Tools of Clustering Algorithm
7 pages
A Survey On Software Suites For Data Mining, Analytics and Knowledge Discovery
No ratings yet
A Survey On Software Suites For Data Mining, Analytics and Knowledge Discovery
6 pages
Report Print
No ratings yet
Report Print
22 pages
Plagiarism
No ratings yet
Plagiarism
17 pages
Ipl Data Analysis PBL
No ratings yet
Ipl Data Analysis PBL
11 pages
GNBG: A Generalized and Configurable Benchmark Generator For Continuous Numerical Optimization
No ratings yet
GNBG: A Generalized and Configurable Benchmark Generator For Continuous Numerical Optimization
22 pages
Research Paper
No ratings yet
Research Paper
8 pages
Staple Python Libraries For Data Science
No ratings yet
Staple Python Libraries For Data Science
26 pages
Windows Malware Detection
No ratings yet
Windows Malware Detection
14 pages
Quantum Gate Benchmarking Framework
No ratings yet
Quantum Gate Benchmarking Framework
54 pages
13 - Data Visualization
No ratings yet
13 - Data Visualization
15 pages
Chapter 3 2
No ratings yet
Chapter 3 2
27 pages
ML LAB Manual
No ratings yet
ML LAB Manual
24 pages
Machine Learning Crash Course For BCA 5th Semester
No ratings yet
Machine Learning Crash Course For BCA 5th Semester
21 pages
AReviewof Clustering Algorithms
No ratings yet
AReviewof Clustering Algorithms
8 pages
Interpretable Clustering: An Optimization Approach: Dimitris Bertsimas Agni Orfanoudaki Holly Wiberg
No ratings yet
Interpretable Clustering: An Optimization Approach: Dimitris Bertsimas Agni Orfanoudaki Holly Wiberg
50 pages
IRJET Scientific Computing and Data Anal
No ratings yet
IRJET Scientific Computing and Data Anal
13 pages
Clustering Algorithms Overview
No ratings yet
Clustering Algorithms Overview
4 pages
COVID-19 Clustering Project Report
No ratings yet
COVID-19 Clustering Project Report
19 pages
10 21105 Joss 06143
No ratings yet
10 21105 Joss 06143
7 pages
Algorithms 10 00105
No ratings yet
Algorithms 10 00105
14 pages
Big Data ML Algorithms Compared
No ratings yet
Big Data ML Algorithms Compared
21 pages
Interpretable Meta-Score For Model Performance
No ratings yet
Interpretable Meta-Score For Model Performance
19 pages
Unit 4
No ratings yet
Unit 4
105 pages
Data Visualization
No ratings yet
Data Visualization
25 pages
Lab 2 22bcs092
No ratings yet
Lab 2 22bcs092
13 pages
CS 229 Project Report: Predicting Used Car Prices
100% (1)
CS 229 Project Report: Predicting Used Car Prices
5 pages
Dsbda Unit4
No ratings yet
Dsbda Unit4
110 pages
Peer Eval
No ratings yet
Peer Eval
6 pages
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
No ratings yet
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
112 pages
Android Malware Detection
No ratings yet
Android Malware Detection
17 pages
Data Science Tools Overview
No ratings yet
Data Science Tools Overview
4 pages
Data-Driven Open-Set Fault Classification of Residual Data Using Bayesian Filtering
No ratings yet
Data-Driven Open-Set Fault Classification of Residual Data Using Bayesian Filtering
8 pages
2023 - CSUR - (AIACTR) - Experimental Comparisons of Clustering Approaches For Data Representation
No ratings yet
2023 - CSUR - (AIACTR) - Experimental Comparisons of Clustering Approaches For Data Representation
34 pages
DM Lab Manual
No ratings yet
DM Lab Manual
5 pages
Mooc Presentation
No ratings yet
Mooc Presentation
13 pages
L5 Slides
No ratings yet
L5 Slides
23 pages
K-Means and MAP REDUCE Algorithm
No ratings yet
K-Means and MAP REDUCE Algorithm
13 pages
Real Estate ML Project Guide
No ratings yet
Real Estate ML Project Guide
20 pages
Data Enggineering
No ratings yet
Data Enggineering
16 pages
Asm 135233
No ratings yet
Asm 135233
3 pages
3 0 Lueckmann21a
No ratings yet
3 0 Lueckmann21a
14 pages
Exp 1
No ratings yet
Exp 1
22 pages
Data Structures For Statistical Computing in Pytho
No ratings yet
Data Structures For Statistical Computing in Pytho
7 pages
Full Text 01
No ratings yet
Full Text 01
77 pages
Enhanced Bat Algorithm for Clustering
No ratings yet
Enhanced Bat Algorithm for Clustering
35 pages
K-Means Vs Mini Batch K-Means: A Comparison
No ratings yet
K-Means Vs Mini Batch K-Means: A Comparison
12 pages
Datascience
No ratings yet
Datascience
26 pages
Data Infrastructure For Machine Learning
No ratings yet
Data Infrastructure For Machine Learning
5 pages
Data Analytics & Decision Trees
No ratings yet
Data Analytics & Decision Trees
51 pages
BCS 402 Lesson 5
No ratings yet
BCS 402 Lesson 5
16 pages
AI + DS - Algo Combined Assignment
No ratings yet
AI + DS - Algo Combined Assignment
6 pages
Paver2 Paper
No ratings yet
Paver2 Paper
18 pages
Assignment 3
No ratings yet
Assignment 3
2 pages
Paramente R Classification Clustering: 11) What Is The Difference Between Classification and Clustering?
No ratings yet
Paramente R Classification Clustering: 11) What Is The Difference Between Classification and Clustering?
2 pages
BM2515 Week 2 Segmentation Slides
No ratings yet
BM2515 Week 2 Segmentation Slides
63 pages
MLT Lab Manual
No ratings yet
MLT Lab Manual
41 pages
IoT Blockchain MITM Attack Clustering
No ratings yet
IoT Blockchain MITM Attack Clustering
10 pages
Optimizing Customer Segmentationinthe Banking Sector
No ratings yet
Optimizing Customer Segmentationinthe Banking Sector
8 pages
Plant Disease Recognition A Large-Scale Benchmark Dataset and A Visual Region and Loss Reweighting Approach
No ratings yet
Plant Disease Recognition A Large-Scale Benchmark Dataset and A Visual Region and Loss Reweighting Approach
13 pages
Nanda and Panda 2013 - A Survey On Nature Inspired Metaheuristic Algorithms For Partitional Clustering
No ratings yet
Nanda and Panda 2013 - A Survey On Nature Inspired Metaheuristic Algorithms For Partitional Clustering
18 pages
Machine Learning With Kernel Methods
No ratings yet
Machine Learning With Kernel Methods
760 pages
Data Mining in Smart Agriculture
No ratings yet
Data Mining in Smart Agriculture
10 pages
Fuzzy Clustering Tutorial
No ratings yet
Fuzzy Clustering Tutorial
45 pages
2018-Clustering by Fast Search and Find of Density Peaks
No ratings yet
2018-Clustering by Fast Search and Find of Density Peaks
6 pages
Sohail DataScientist
No ratings yet
Sohail DataScientist
3 pages
Virus-MNIST: Malware Image Dataset
No ratings yet
Virus-MNIST: Malware Image Dataset
6 pages
AI for Partial Discharge Signal Analysis
No ratings yet
AI for Partial Discharge Signal Analysis
5 pages
Noc20-Cs28 Week 08 Assignment 01 PDF
No ratings yet
Noc20-Cs28 Week 08 Assignment 01 PDF
3 pages
Cluster Analysis and K-Means Guide
No ratings yet
Cluster Analysis and K-Means Guide
20 pages
Weather Patterns Analysis and Prediction
No ratings yet
Weather Patterns Analysis and Prediction
17 pages
Unit 4 Ensemble Techniques and Unsupervised Learning
100% (1)
Unit 4 Ensemble Techniques and Unsupervised Learning
25 pages
Complete Doc - Lavanya
No ratings yet
Complete Doc - Lavanya
95 pages
Machine Learning Techniques - SDN
No ratings yet
Machine Learning Techniques - SDN
38 pages
Unit 5
No ratings yet
Unit 5
92 pages
AI & ML Lab Manual - LDCE
No ratings yet
AI & ML Lab Manual - LDCE
70 pages
Big Data Practical
No ratings yet
Big Data Practical
35 pages
SPSS Tutorial Cluster Analysis
No ratings yet
SPSS Tutorial Cluster Analysis
42 pages
Lec 2
No ratings yet
Lec 2
11 pages
Technical Documenetflix Technicalnt
No ratings yet
Technical Documenetflix Technicalnt
15 pages
Facto Extra
No ratings yet
Facto Extra
84 pages
ECS765P - W4 - Introduction To Spark
No ratings yet
ECS765P - W4 - Introduction To Spark
39 pages
Customer Segmentation Using Machine Learning
100% (1)
Customer Segmentation Using Machine Learning
28 pages

A Framework For Benchmarking Clustering Algorithms

Uploaded by

A Framework For Benchmarking Clustering Algorithms

Uploaded by

SoftwareX 20 (2022) 101270

Contents lists available at ScienceDirect

Original software publication

A framework for benchmarking clustering algorithms

Current code version 1.1.1

1. Introduction quality of partitions into a single number [7–9]. In practice, they

battery, dataset = "wut", "x2" b.data, labels=b.labels[i]-1,

res = clustbench.load_results("Genie", b.battery, Data availability

You might also like