Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
80 views11 pages

K Means Clustering

K-Means Clustering is an unsupervised learning algorithm that groups unlabeled datasets into a predetermined number of clusters based on similarities. The algorithm iteratively assigns data points to the nearest centroids and recalculates centroids until the clusters stabilize. Unsupervised learning also includes association methods, which identify relationships between variables in large datasets.

Uploaded by

priskilla Selvin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views11 pages

K Means Clustering

K-Means Clustering is an unsupervised learning algorithm that groups unlabeled datasets into a predetermined number of clusters based on similarities. The algorithm iteratively assigns data points to the nearest centroids and recalculates centroids until the clusters stabilize. Unsupervised learning also includes association methods, which identify relationships between variables in large datasets.

Uploaded by

priskilla Selvin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to


solve the clustering problems in machine learning or data science.

Unsupervised learning
Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.
Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the
given dataset, which means it does not have any idea about the features of the
dataset. The task of the unsupervised learning algorithm is to identify the image
features on their own. Unsupervised learning algorithm will perform this task by
clustering the image dataset into the groups according to similarities between
images.

Types of Unsupervised Learning Algorithm:


The unsupervised learning algorithm can be further categorized into two
types of problems:

o Clustering: Clustering is a method of grouping the objects into


clusters such that objects with most similarities remains into a group
and has less or no similarities with the objects of another group.
Cluster analysis finds the commonalities between the data objects and
categorizes them as per the presence and absence of those
commonalities.
o Association: An association rule is an unsupervised learning method
which is used for finding the relationships between variables in the
large database. It determines the set of items that occurs together in
the dataset. Association rule makes marketing strategy more effective.
Such as people who buy X item (suppose a bread) are also tend to
purchase Y (Butter/Jam) item. A typical example of Association rule is
Market Basket Analysis.

What is K-Means Algorithm?


 K-Means Clustering is an Unsupervised Learning algorithm, which
groups the unlabeled dataset into different clusters. Here K defines the
number of pre-defined clusters that need to be created in the process,
as if K=2, there will be two clusters, and for K=3, there will be three
clusters, and so on.
 It is an iterative algorithm that divides the unlabeled dataset into k different clusters
in such a way that each dataset belongs only one group that has similar properties.
 It allows us to cluster the data into different groups and a convenient
way to discover the categories of groups in the unlabeled dataset on
its own without the need for any training.
 It is a centroid-based algorithm, where each cluster is associated with
a centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.
 The algorithm takes the unlabeled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find
the best clusters. The value of k should be predetermined in this
algorithm.

The below diagram explains the working of the K-means Clustering


Algorithm:
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the
new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Let's understand the above steps by considering the visual plots:
THEORITICAL EXPLANATION-K Means Algorithm
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these
two variables is given below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to
put them into different clusters. It means here we will try to group
these datasets into two different clusters.
o We need to choose some random k points or centroid to form the
cluster. These points can be either the points from the dataset or any
other point. So, here we are selecting the below two points as k points,
which are not the part of our dataset. Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-
point or centroid. We will compute it by applying some mathematics
that we have studied to calculate the distance between two points. So,
we will draw a median between both the centroids. Consider the below
image:

From the above image, it is clear that points left side of the line is near to the
K1 or blue centroid, and points to the right of the line are close to the yellow
centroid. Let's color them as blue and yellow for clear visualization.

o As we need to find the closest cluster, so we will repeat the process by


choosing a new centroid. To choose the new centroids, we will
compute the center of gravity of these centroids, and will find new
centroids as below:

o Next, we will reassign each datapoint to the new centroid. For this, we
will repeat the same process of finding a median line. The median will
be like below image:

From the above image, we can see, one yellow point is on the left side of the
line, and two blue points are right to the line. So, these three points will be
assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which is


finding new centroids or K-points.

o We will repeat the process by finding the center of gravity of centroids,


so the new centroids will be as shown in the below image:
o As we got the new centroids so again will draw the median line and
reassign the data points. So, the image will be:

o We can see in the above image; there are no dissimilar data points on
either side of the line, which means our model is formed. Consider the
below image:

As our model is ready, so we can now remove the assumed centroids, and
the two final clusters will be as shown in the below image:

You might also like