202l International Conference on Electronics, Connnunications and Infonnation Technology (ICECIT), 14-16 September 2021, Khulna, Bangladesh.
K-Cosine-Means Clustering Algorithm
Md. Kafi Khan l , Sakil Sarker2 , Syed Mahmud Ahmed 3 , Mozammel H A Khan 4
Department of Computer Science and Engineering
East West University
Aftabnagar, Dhaka-1212, Bangladesh
1
[email protected], 2
[email protected], 3 mahmudewu 17 @gmail.com, 4
[email protected]o
00
<t
.-<
<t
w Abstract-K-means algorithm is a clustering algorithm that As a result, it is bound to have some shortcomings; the most
m
.--i is one of the most widely used unsupervised techniques in data pre-eminent ones are (i) the user has to make preconceived
mining. This paper presents an extension of K-means algorithms
N
o
N assumptions about the number of clusters, (ii) it does not
....: named K-cosine-means algorithm. While the K-means algorithm
"o<t initializes the centroids randomly and uses the Euclidean distance always yield optimum results, (iii) the quality of clustering
'"f- measure to assign data points to clusters, our proposed algorithm is predetermined by the initial guess of centroids, (iv) the
U
u.J
U
inherits a systematic approach from K-means++ to initialize the presence of outliers tend to drag the centroids towards a
.:::::.
m
centroids and utilizes Cosine similarity to assign data points to suboptimal position, and (v) it assumes all the features carries
o clusters. We have performed experiments on both homogeneous
.-<
the same weight [9]. Consequently, there have been numerous
datasets (Iris and Seeds datasets) and heterogeneous dataset
.-<
ci
.-<
(Hepatitis dataset). From experimental results, we have observed researches centered around improving some shortcomings of
is
o better clustering accuracy on homogeneous datasets compared the K-means algorithm.
to other variants of the K-means algorithm, namely, K-means,
u.J
u.J
u.J IK-means, K-means++, WK-means, MWK-means, iWK-means, In the works of [10]-[12], the authors focused solely on
and iMWK-means. However, for heterogeneous dataset, we have solving the fifth issue of K-means algorithm and in tum,
observed better clustering accuracy compared to standard K- proposed the Weighted K-means algorithm (WK-means). This
means, K-means++, and iK-means algorithms. algorithm assigns weights to each of the features based on
Index Terms-Centroids initialization, Cosine similarity, K- the significance of their impact while clustering. The work
Cosine-Means clustering, K-Means clustering, Unsupervised
learning of [13] follows a further extension of the WK-means algorithm
proposed by Huang et at. [10], where they have used cluster
I. INTRODUCTION independent weights instead of cluster-specific weights. They
have done extensive research on the subject matter and coined
Due to the astounding proliferation of data caused by
algorithms such as MWK-means [13] and CMWK-means [9].
modem technologies such as the internet, human limitations
make it improbable to manually analyze such vast amounts of The first drawback of K-means algorithm is tackled by the
information in any meaningful way. Consequently, data mining Intelligent K-means or IK-means algorithm introduced in [3].
i=' has gained major traction in the recent decades to detect The Algorithm acts upon normalized data and selects the data
u
patterns among such large collections of data [1]. Clustering instances farthest distance away from the center and uses them
u.J
!:!
>-
bJJ is an unsupervised data mining method that classifies the as initial seeds. Furthermore, the clusters are formed by taking
o
oc raw data reasonably and searches the underlying patterns that into account the distance between entities as well as their
..c
u
OJ
exist in datasets [2]. In other words, clustering is the division distances from the center.
of datasets into groups based on mutual similarity among
f-
c
.o;:; There is still scope of improving K-means and its variants
ro instances; in doing so, it simplifies data and increases com-
E to improve the clustering accuracy. In this paper, we present
prehensibility. Clustering has been effectively implemented to
E an improvement of the standard K-means algorithm named
E solve many diverse problems such as, anomaly detection [3],
""0 K-cosine-means algorithm. The difference between these two
C
ro data summarization [4], malware detection [5], etc. While
algorithms is that our algorithms use cosine similarity instead
'"c some information may be lost in the process [6], clustering still
..o;:; of Euclidean distance to assign data points to clusters. To
ro remains an important method for gaining knowledge through
u
·c evaluate the performance of our algorithm, we have applied
determining patterns within large amounts of data.
it on multiple datasets, namely, the Iris, Seeds and Hepatitis
:::J
E
E In data mining literature, clustering algorithms are usually
8 datasets. In doing so, we have witnessed an increase in
divided into two groups; namely, Hierarchical and Partitional.
",'
performance in terms of accuracy compared to K-means and
.~ Hierarchical algorithms generate dendrograms by continuously
gu
c
other similar algorithms.
merging or dividing clusters. Whereas, Partitional clustering
OJ
L:U requires prior knowledge about the number of clusters within The rest of the paper is organized as follows: In Section II,
c
o a dataset before assigning data instances to their corresponding we discuss our proposed methodology. We present our ex-
OJ
u
c clusters through iterations. Among the Partitional algorithms, perimental results with analysis in comparison to previously
~
2c K-means [7], [8] is the simplest clustering algorithms in terms published result in Section III. Finally, we conclude the paper
8 of the number of criteria it takes into account while clustering. in Section IV with direction to future works.
ro
c
..o;:;
ro
E
2
E
.-<
N
?':l 978-1-6654-2363-2/21/$31.00 ©2021 IEEE
Authorized licensed use limited to: Qatar Foundation (Qatar National Library). Downloaded on May 21,2025 at 14:18:06 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1 Method for centroid initialization
6 .E
1: Choose one centroid randomly from the data instances.
2: Calculate the Euclidean distances between all data points 5 .D .F
that were not chosen as centroids from all of the chosen
centroids and select the minimum distance. 4 .c
3: Using a weighted probability distribution, choose the next
centroid from these data points based on a probability
..
'x'"
>-
3
proportional to minimum of the distances calculated in
step 2. .A
4: Repeat steps 2 and 3 until the required number of centroids
are initialized. .8
3 4 5 6 7
X axis
II. PROPOSED METHODOLOGY
Fig. 1: Example Dataset
Our proposed algorithm can be divided into two segments,
specifically, (A) Initializing centroids and (B) Clustering. TABLE I: Example Dataset
A. Initializing Centroids Data Points/Coordinates X Y
One of the major drawbacks of the K-means algorithm A 1 2
B 2 1
resides significantly in the selection of centroids by the C 3 4
system. We have found that the quality of initial centroids D 4 5
can deteriorate the accuracy of the clusters up to 47.33% E 6 6
F 7 5
for the Iris dataset. Theoretically, if we can select the initial
centroids algorithmically rather than randomly, the quality
TABLE II: Distance of all the datapoints from {B}
of clusters has the potential to improve. The K-means++
algorithm proposed by Arthur et at. [14] presents a novel Data PointslDistance B Minimum
approach of selecting the initial centroids. By maximizing the A 1.414 1.414
probability of initial centroids being from different clusters, the B 0 0
C 3.162 3.162
algorithm increases the probability of generating final clusters D 4.472 4.472
with comparatively better fitness. The process of selecting the E 6.403 6.403
initial centroids is shown in Algorithm 1. F 6.403 6.403
Let us consider Fig. 1 and corresponding Table I, as an
example of a two-dimensional dataset having six data points TABLE III: Distance of all the datapoints from {B, E}
and three clusters. So, we need to initialize three centroids for
Data PointslDistance B E Minimum
three clusters, respectively. First, we have randomly chosen A 1.414 6.403 1.414
data point B as a centroid. In Table II, we have calculated the B 0 6.403 0
distances of all the data points from B. In doing so, we get C 3.162 3.605 3.162
D 4.472 2.236 2.236
E and F with the highest distance from B. As a result, we
E 6.403 0.0 0
can randomly choose any of E and F as the second centroid. F 6.403 1.414 1.414
Here, we have chosen E as the second centroid. Finally, in
Table III, we have calculated the distances of all the data points
from the previously selected centroids {B, E} and reported the inner product of the two vectors normalized to have a
the minimum distance. According to the minimum distance, length of 1. Given two non-zero vectors and b such that a
C yields the highest distance. Thus, C is the third centroid.
Thus, after applying this procedure the initial centroids are
{B, C, E}, which clearly belong to the three separate clusters Iiall = J[ai + a§ + a§ + ... + a~] and
in Fig. 1. Before generating clusters, we have incorporated this
method for initializing the centroids.
B. Clustering Ilbll = J[bI + b§ + b§ + ... + b~].
Our K-cosine-means algorithm inherits a few characteristics
from the K-means algorithm. However, instead of using The cosine similarity between these two vectors is formulated
Euclidean distance, we have implemented a cosine-similarity- as
based measure. Cosine similarity measures the cosine of
the angle between two non-zero vectors to determine the
similarity among them. Therefore, it can also be defined as cos e= a·7]
------=c-
Authorized licensed use limited to: Qatar Foundation (Qatar National Library). Downloaded on May 21,2025 at 14:18:06 UTC from IEEE Xplore. Restrictions apply.
Algorithm 2 Proposed clustering algorithm TABLE IV: Confusion Matrix of Iris Dataset
I: Define the number of clusters denoted by K. Actual/predicted Setosa Versicolor Virginica
2: Using the method for choosing initial centroids as shown Setosa 50 0 0
in Algorithml, designate K number of data points as initial Versicolor 0 46 4
Virginica 0 0 50
centroids Ci, C2, "', CK.
3: For each data point x, calculate cosine similarity between
TABLE V: Comparative Accuracy on Iris Dataset
all centroids Ci, C2, "', CK.
4: From step 3, assign each data point x to a cluster with Algorithm Accuracy
centroid Ci such that x has the maximum similarity with Cosine K-means 97.33%
WK-means [9] 96.0%
Ci amongst all the centroids Ci, C2, "', CK.
MWK-means [9] 96.7%
5: Average the coordinates of all data points in each cluster iWK-means [13] 96.7%
and select the averaged coordinates as the new centroid. iMWK-means [13] 96.7%
6: Repeat steps 3 to 5 until all of the centroids can no longer K-means [9] 89.3%
iK-means [13] 88.7%
be updated.
TABLE VI: Confusion Matrix of Hepatitis Dataset
A higher value of cos e is desirable since it indicates closeness Actual vs predicted
of the two data points. False
True
Our proposed algorithm is shown in Algorithm 2.
However, we can theoretically predict that the algorithm
may perform poorly in terms of clustering datasets with hetero- different variations of K-means algorithms from [15] and [13].
geneous data. Heterogeneous datasets have different attribute To evaluate the performance of our clustering model, we have
types, e.g. integer, real, and categorical. When categorical data conducted our experiment on real datasets, namely, Iris, Seeds,
are arbitrarily assigned numerical values, it causes a loss of and Hepatitis datasets downloaded from the UCI Machine
information as the randomly assigned numerical values have Learning Repository [16].
no correlation with the original essence of the data. This
problem becomes more prominent when a dataset contains A. Iris Dataset
both real and categorical features. Even though categorical The Iris dataset [17] holds the measurements of petals
data poses a problem while representing in vector space, it and sepals of three different Iris species, namely, Setosa,
is not that pronounced when a dataset contains only categor- Versicolor, and Virginica. This dataset contains 3 classes (K
ical data as all the data were converted following the same = 3), 150 data instances, and 4 real-valued features. Upon
convention. However, when both of the data types are present applying our K-cosine-means algorithm, we have achieved
in the same dataset, as the converted values do not hold the an accuracy of 97.33%. The confusion matrix is displayed
essence of the categorical data, the cosine similarity becomes in Table IV, and comparison between variants of K-means
less meaningful. As a result, data points are not accurately is tabulated in Table V. Compared to these algorithms, our
assigned to their proper clusters. algorithm has produced the maximum accuracy when applied
to the Iris datasets.
C. Experimental Settings
Our proposed K-cosine-means algorithm is coded on a B. Hepatitis Dataset
Windows PC with Intel 8th gen Core i3 CPU that has 2 The Hepatitis dataset [18] contains 155 data instances,
cores each with base frequency of 2.20 GHz. Moreover, it 18 features, and 2 classes (K = 2). Out of the 18 total
also has 8GB RAM and 128GB NVMe SSD. We have coded features of the dataset, 12 features are categorical, that is,
the K-cosine-means algorithm using Python with NumPy and their values are binary encoded. The confusion matrix and per-
Pandas. The seed for the random number generator of NumPy formance comparison between other algorithms are presented
was set to 1. in Tables VI, and VII, respectively. While our algorithm did
perform better than the standard K-means, K-means++, and
III. RESULT ANALYSIS
iK-means algorithms, it was not able to outperform some of
The evaluation of the performance of different models is the other K-means variants. This is in line with our earlier
done with accuracy measures and confusion matrix in this prediction that, when applied to datasets that contain a mixture
paper. The equation of accuracy measure is as follows: of features such as categorical values and real values, our K-
Number of Correct Predictions 100 at cosine-means approach will display decreased performance.
Accuracy = . . X 10
Total Number of PredictlOns Made (1) C. Seeds Dataset
The confusion matrix demonstrates the performance of a Seeds dataset [19] consists of measurements of three differ-
model by tabulating the success and failure to classify all the ent wheat kernels, namely, Kama, Rosa, and Canadian. This
classes. For result analysis, we have collected the accuracies of dataset contains 210 data instances, 7 real-valued features, and
Authorized licensed use limited to: Qatar Foundation (Qatar National Library). Downloaded on May 21,2025 at 14:18:06 UTC from IEEE Xplore. Restrictions apply.
TABLE VII: Comparative Accuracy on Hepatitis Dataset standard K-means and the Intelligent K-means (IK-means)
Algorithm Accuracy
algorithms, it showed poorer performance relative to the other
iMWK-means [13] 84.52% algorithms, namely WK-means, MWK-means, iWK-means,
MWK-means [9] 80.0% iMWK-means. Therefore, one potential direction for future
WK-means [9] 80.0% works can be to increase the accuracy of our algorithm for
iWK-means [13] 78.71%
K-Cosine-means 77.50% heterogeneous datasets. Feature selection might improve the
iK-means [13] 72.26% accuracy of our algorithm in such cases by pruning certain
K-means [9] 72.26% categorical features. From our primary observation, pruning
two specific real attributes from the Hepatitis dataset led to
TABLE VIII: Confusion Matrix of Seeds Dataset improved clustering accuracy from 76.25% to 77.5%.
Actual/predicted Kama Rosa Canadian
Outlier detection and handling may be incorporated to
Kama 61 4 5 improve the clustering accuracy. Using an approach similar to
Rosa 2 67 1 K-medians to update cluster centers might help with arbitrarily
Canadian 8 0 62 shaped clusters.
TABLE IX: Comparative Accuracy on Seeds Dataset REFERENCES
[1] O. A. Abbas. "Comparisons between data clustering algorithms." Inter-
Algorithm Accuracy national Arab lournal of Information Technology (IAlIT), vol. 5, no. 3,
K-Cosine-means 90.47% 2008.
K-means [15] 89.20% [2] Z. Huang, "Extensions to the k-means algorithm for clustering large data
K-means++ [15] 89.0% sets with categorical values," Data mining and knowledge discovery,
iK-means [15] 87.10% vol. 2, no. 3, pp. 283-304, 1998.
[3] B. Mirkin, "Clustering for data mining-a data recovery approach. new
york," 2005.
3 classes (K = 3). We only found a limited number of literature [4] A. K. Jain, "Data clustering: 50 years beyond k-means," Pattern recog-
nition letters, vol. 31, no. 8, pp. 651--666, 2010.
that reported on their algorithm's accuracy when applied to [5] R. Cordeiro de Amorim and P. Komisarczuk, "On partitional clustering
the Seeds dataset. The Confusion matrix and performance of malware;' 2012.
[6] J. C. Zak and M. R. Willig, "Fungal biodiversity patterns;' Biodiversity
comparison are displayed in Tables VIII and IX, respectively. of fungi: Inventory and monitoring methods, pp. 59-75, 2004.
Our K-cosine-means algorithm has produced better accuracy [7] J. MacQueen et aI., "Some methods for classification and analysis of
than K-means, K-means++, and iK-means algorithms. multivariate observations;' in Proceedings of the fifth Berkeley sympo-
sium on mathematical statistics and probability, vol. 1, no. 14. Oakland,
The datasets that have a mixture of both real and categorical CA, USA, 1967, pp. 281-297.
features are called heterogeneous datasets that are susceptible [8] G. H. Ball and D. J. Hall, "A clustering technique for summarizing
to information loss when converted to numerical values. As multivariate data;' Behavioral science, vol. 12, no. 2, pp. 153-155, 1967.
[9] R. C. de Amorim, "Constrained clustering with minkowski weighted k-
we previously predicted, one of the major drawbacks of our means;' in 2012 IEEE 13th International Symposium on Computational
algorithm is the absence of methods that can handle the Intelligence and Informatics (CINTI). IEEE, 2012, pp. 13-17.
mixture of data types properly. In our experiment, the Hepatitis [10] J. Z. Huang, J. Xu, M. Ng, and Y. Ye, "Weighting method for feature
selection in k-means;' in Computational Methods of feature selection.
dataset was a heterogeneous dataset and Iris and Seeds were Chapman and Hall/CRC, 2007, pp. 209-226.
homogeneous datasets. As it is evident by our results, our [11] E. Y. Chan, W. K. Ching, M. K. Ng, and J. Z. Huang, "An optimization
algorithm only falls short when applied to Hepatitis dataset algorithm for clustering using weighted dissimilarity measures;' Pattern
recognition, vol. 37, no. 5, pp. 943-952, 2004.
due to the presence of heterogeneous data. On the other hand, [12] J. Z. Huang, M. K. Ng, H. Rong, and Z. Li, "Automated variable
our algorithm performs considerably well when applied to weighting in k-means type clustering;' IEEE transactions on pattern
homogenous datasets. analysis and machine intelligence, vol. 27, no. 5, pp. 657-668, 2005.
[13] R. C. De Amorim and B. Mirkin, "Minkowski metric, feature weight-
IV. CONCLUSION ing and anomalous cluster initializing in k-means clustering;' Pattern
Recognition, vol. 45, no. 3, pp. 1061-1075,2012.
From the empirical analysis gained from our experiments, [14] D. Arthur and S. Vassilvitskii, "K-means++: The advantages of careful
we can conclude that using the cosine similarity measure to seeding in: Proceedings of the eighteenth annual acm-siam symposium
on discrete algorithms. soda' 07, society for industrial and applied
assign data points to clusters does improve the accuracy for mathematics, 1027-1035, philadelphia, pa, usa;' 2007.
some datasets when compared to the standard K-means and [15] M. A. Masud, M. M. Rahman, S. Bhadra, and S. Saha, "Improved
other K-means-like algorithms. More specifically, our cosine- k-means algorithm using density estimation;' in 2019 International
Conference on Sustainable Technologies for Industry 4.0 (STI). IEEE,
similarity-based approach has consistently outperformed these 2019, pp. 1--6.
algorithms when applied to homogeneous datasets, namely [16] "UC Irvine Machine Learning Repository;'
the Iris and Seeds dataset. These datasets exclusively contain https:llarchive.ics.uci.edu/ml/index.php.
[17] R. A. Fisher, "The use of multiple measurements in taxonomic prob-
features with the same type of values. lems;' Annals of eugenics, vol. 7, no. 2, pp. 179-188, 1936.
Our K-cosine-means algorithm falls short when the dataset [18] P. Diaconis and B. Efron, "Computer-intensive methods in statistics;'
is heterogeneous, i.e. when there are different types of values Scientific American, vol. 248, no. 5, pp. 116-131, 1983.
[19] M. Charytanowicz, J. Niewczas, P. Kulczycki, P. A. Kowalski,
for different features. This became evident when our algo- S. Lukasik, and S. Zak, "Complete gradient clustering algorithm for
rithm was applied on the heterogeneous Hepatitis dataset; features analysis of x-ray images;' in Information technologies in
while our cosine-similarity-based approach outperformed the biomedicine. Springer, 2010, pp. 15-24.
Authorized licensed use limited to: Qatar Foundation (Qatar National Library). Downloaded on May 21,2025 at 14:18:06 UTC from IEEE Xplore. Restrictions apply.