0% found this document useful (0 votes)

59 views10 pages

K-Means Clustering Explained

This document describes a final project analyzing K-means clustering and its applications. It introduces K-means clustering, discusses how to determine the optimal number of clusters, and covers algorithms like K-means, the elbow method, and silhouette method. It then applies these techniques to cluster income and spending data into 5 groups and compress an image by grouping pixel colors into K clusters.

Uploaded by

Srikarrao Naropanth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views10 pages

K-Means Clustering Explained

Uploaded by

Srikarrao Naropanth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

FINAL PROJECT

K-Means Clustering Analysis and its Applications

– DONTHULA SRINISH (EE20BTECH11015)
– NAROPANTH SRIKAR RAO (EE20BTECH11035)

INTRODUCTION:
K-means clustering is a popular unsupervised machine learning algorithm used to group
data points into k clusters based on their similarity. Basically, it finds groups which are
similar in an unlabelled data. K-means is commonly used in fields such as image
recognition, image compression, market segmentation, and anomaly detection.
However, it requires careful consideration of the number of clusters to use and can be
sensitive to initial centroid selection.

One important consideration in K-means clustering is choosing the optimal

number of clusters (k) for the data. There are several methods to determine k, such as
the Elbow method, Silhouette method, or gap statistic. One limitation of K-means
clustering is that it assumes spherical clusters with equal variance. If the data has
non-spherical clusters or varying cluster densities, other algorithms such as DBSCAN or
Gaussian mixture models may be more suitable.

In this following picture, we can

clearly differentiate the data into 3 different
different clusters. Each cluster can be
identified through a different color. Each
feature can be categorized based on the
region it falls under one of these clusters.
Also, the big circles in the picture are the
cluster centers. These cluster centers are
relocated such that all the data points in the
nearby region are closer to this cluster
center compared to other cluster centers.
ALGORITHM :
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. We will
randomly initialize the K cluster centroids as K random points in the dataset.
The k-means clustering algorithm mainly performs two tasks:
● Determines the best value for K center points or centroids by an iterative process.
● Assigns each data point to its closest k-center. Those data points which are near
to the particular k-center, create a cluster.
This algorithm is already implemented in the scikit-learn package i.e we can apply
K-means by importing sklearn.cluster.KMeans.

ELBOW METHOD:
As discussed earlier, the most important thing is to choose the optimal number of
clusters(K) for a given data. There are several methods to choose the optimal value for K
and the Elbow method is one of the most widely used methods to find it.
K-Means algorithm finds the clusters by minimizing the mean square distance between
cluster centroids and the data points. In the Elbow method, we run the K-Means
algorithm for different values of K and plot cost(inertia) function vs K. From the plot, we
get an elbow shaped curve and the optimal value for K is chosen at the elbow or the
maximum slope varying point.

ANALYSIS OF K-MEANS CLUSTERING:

Firstly, we considered a dataset of different people with their annual income and
spending score(1-100) . This is an unlabeled data. The people can be grouped as:
● Annual income is high and spending score is high
● Annual income is low and spending score is high
● Annual income is high and spending score is low
● Annual income is low and spending score is low
The rest of the people can be categorized into a separate category where both annual
income and spending score is close to average.
To find the number of groups among the people according to their annual income and
spending score first we used the elbow method. By implementing the K-Means
algorithm on this dataset using scikit-learn and plotting cost(inertia) function for
different values of K we got:
From the above plot, we can see that the curve changes its slope abruptly at 2 different
points. But, in order to minimize the total cost, we consider the number of clusters(K)=5.
By considering K=5, we now implement the K-Means clustering algorithm and plot
different clusters.

In this plot, different clusters are

marked with different colors and
all the five cluster centroids are
marked with black color. The red
cluster would have the most
average spending score and
average annual income.
SILHOUETTE METHOD:
The Silhouette method is a clustering evaluation technique used to measure the quality
of a clustering solution. It provides a way to assess how well each data point fits into its
assigned cluster by computing a silhouette score.

The silhouette score for a data point measures how similar it is to its own cluster
compared to other clusters. It is calculated as the difference between the average
distance of a data point to all other points in its own cluster (a) and the average distance
to all points in the nearest neighboring cluster (b), divided by the maximum of the two
distances:
silhouette score = (b - a) / max(a, b)
The silhouette score ranges from -1 to 1, where a score of 1 indicates a well-clustered
data point that is far from other clusters, and a score of -1 indicates a data point that has
been assigned to the wrong cluster.To apply the Silhouette method, we compute the
average silhouette score for all data points in a clustering solution. Higher average
silhouette scores indicate a better clustering solution with well-separated clusters and
appropriate data point assignments.
The Silhouette method is often used in conjunction with other clustering evaluation
techniques, such as the elbow method or the gap statistic, to select the optimal number
of clusters for a given dataset.
Now, let us implement this method on our dataset. These are the following results:
Clearly, from the silhouette score vs number of clusters plot, the dataset has achieved its
maximum silhouette score at K = 5. Even through the elbow method we got the number
of optimal clusters as 5. This shows that both are optimal methods.
Now, let us apply this K-Means algorithm on Image Compression. Let us see the change
in colors after applying K-Means for different values of K.

IMAGE COMPRESSION:
We can use K-Means Algorithm to compress an Image. An Image consists of pixels where
each pixel has some color. Each pixel is made up of a red, green and blue subpixel that
lights up at different intensities to create different colors. The specific color information
that a pixel describes is some blend of three components of the color spectrum -- RGB.

In a straightforward 24-bit color representation of an image, each pixel is represented as

three 8-bit unsigned integers (ranging from 0 to 255) that specify the red, green and blue
intensity values. This encoding is often referred to as the RGB encoding.

We have taken a bird image to compress it. We have scaled the red, green and blue
intensity values of pixels to 1 by dividing 255.

The total number of pixels in the above image are 950*950 which is 9,02,500 pixels.
Each pixel has 3 intensity values of red, green and blue. The intensity values(scaled to 1)
of the pixels are plotted above.
Now we have a data of 9,02,500 pixels which are plotted in 3D. We can group the data
points(intensities of pixels) into K-clusters (K colors) in such a way that the intensities of
red, green and blue are similar or nearer. So, to compress the image, we can replace the
group of pixels with a pixel that has an intensity value which is the average of the
intensities of the group. Instead of a group of pixels, we are making it to one pixel which
is compression of the image.

To do this, we have used the K-Means Algorithm which is basically grouping the data
points. Here we are grouping colors with different intensities. To implement this
algorithm, we can use the scikit-learn package. But, we have implemented the algorithm
from scratch using numpy and matplotlib for better understanding of the algorithm.

We have divided the algorithm into some parts by implementing different functions and
then finally using all of them we can run the algorithm. The functions that we
implemented are:

KMeans_init_centroids(X, K)--> At the start of the K-Means algorithm, we

need to initialize the cluster centroids. This function randomly
initializes K cluster centroids by taking K data points from X. It returns
the K cluster centroids.

find_closest_centroids(X, centroids)--> This function will find the

closest centroid for each of the data points by computing the distance
from all centroids and then consider the centroid with minimum distance as
the closest centroid. It returns the index of the closest centroid for
each data point in X.

compute_centroids(X, index, K)--> After finding the closest cluster

centroid for each and every point in X, we need to update the cluster
centroids as the mean of the data points having same index. It returns the
updated cluster centroids after each iteration of the KMeans Algorithm.

run_KMeans(X, initial_centroids, max_iters)--> The KMeans Algorithm is

carried out in this function. In every iteration, we find the closest
clustered centroids for all data points and the updated cluster centroids.
After the max_iters, we are done with the algorithm. The pixels have been
grouped according to the value of K(K colors). It return the compressed
pixel values of the image.
RESULTS:
After implementing the above functions, we can understand the KMeans algorithm in a
broad way. After implementing the algorithm for the above image for different values of
K(colors) 2, 4, 8, 16 and 32, we got the above compressed images.

From the above images, we can see that the image is compressed for different values of
K which means different numbers of colors. When K=2, we can clearly see that the image
is compressed to only 2 colors. As the value of K increases, the number of colors
increases which means more clusters. We can see that clearly in the above images.
CONCLUSION:
➢ To find the optimal number of clusters we can use both elbow method and
silhouette method.
➢ In summary, k-means clustering is a powerful and versatile algorithm that can be
used for a variety of applications. However, it is important to carefully consider
the choice of K and to be aware of the assumptions and limitations of the
algorithm.
➢ K-Means clustering is a powerful technique for compressing images, but the
quality of the resulting compressed image can be highly dependent on the
number of centroids used and the choice of distance metric. Careful
experimentation and tuning are required to obtain the best results.

GITHUB LINK FOR CODES:

➢ https://github.com/Srinish27/Data_Science_Analysis_Project

REFERENCES:
❖ https://en.wikipedia.org/wiki/Silhouette_(clustering)
❖ https://www.javatpoint.com/k-means-clustering-algorithm-in-m
achine-learning
❖ https://drive.google.com/file/d/1nkQMQzbMMlGXX4h6i9eS8Mj
DFrUzcVlJ/view?usp=sharing
❖ https://en.wikipedia.org/wiki/Elbow_method_(clustering)#:~:te
xt=In%20cluster%20analysis%2C%20the%20elbow,number%20of
%20clusters%20to%20use.

Question Bank: Subject Name: Artificial Intelligence & Machine Learning Subject Code: 18CS71 Sem: VII
100% (2)
Question Bank: Subject Name: Artificial Intelligence & Machine Learning Subject Code: 18CS71 Sem: VII
8 pages
K-Means Algorithm
No ratings yet
K-Means Algorithm
29 pages
K-Means Clustering Report
No ratings yet
K-Means Clustering Report
2 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Shivwangi Banerjee (ML)
No ratings yet
Shivwangi Banerjee (ML)
8 pages
0006 - K Means Clustering - Introduction - 2025
No ratings yet
0006 - K Means Clustering - Introduction - 2025
19 pages
K-Means Clustering
No ratings yet
K-Means Clustering
4 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
Unit 3 Unsupervised Learning
No ratings yet
Unit 3 Unsupervised Learning
9 pages
K Means Clustering - Introduction - GeeksforGeeks
No ratings yet
K Means Clustering - Introduction - GeeksforGeeks
11 pages
Day 3
No ratings yet
Day 3
74 pages
Python K-Means Clustering Guide
No ratings yet
Python K-Means Clustering Guide
6 pages
K Means
No ratings yet
K Means
25 pages
Unit 4
No ratings yet
Unit 4
46 pages
Wa0033.
No ratings yet
Wa0033.
38 pages
02.1 K-Means Example
No ratings yet
02.1 K-Means Example
12 pages
K.means Clustering
No ratings yet
K.means Clustering
8 pages
ML Clustering2
No ratings yet
ML Clustering2
11 pages
Determinnant 3 by 3 Matrix Practice
100% (1)
Determinnant 3 by 3 Matrix Practice
4 pages
Neural Network Clustering Guide
No ratings yet
Neural Network Clustering Guide
168 pages
DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
Mini Project
No ratings yet
Mini Project
8 pages
KMeans Clustering Report
No ratings yet
KMeans Clustering Report
2 pages
Linear Programming 1
No ratings yet
Linear Programming 1
6 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Unit 4
No ratings yet
Unit 4
19 pages
ADL LAB Manual
No ratings yet
ADL LAB Manual
27 pages
SE KMeansClustering
No ratings yet
SE KMeansClustering
21 pages
K Means Algorithm
No ratings yet
K Means Algorithm
4 pages
Unit 4
No ratings yet
Unit 4
22 pages
19.1. Partitioning-Based Clustering Algorithms
No ratings yet
19.1. Partitioning-Based Clustering Algorithms
27 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
8 - Clustering
No ratings yet
8 - Clustering
85 pages
Chapter 2.1 - Kmean
No ratings yet
Chapter 2.1 - Kmean
10 pages
Algo
No ratings yet
Algo
59 pages
Kmeansfinal
No ratings yet
Kmeansfinal
16 pages
Intro to Cluster Analysis
No ratings yet
Intro to Cluster Analysis
90 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
BDU - Document - Dominant Color in An Image Using K
No ratings yet
BDU - Document - Dominant Color in An Image Using K
46 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Intro To ML Ass
No ratings yet
Intro To ML Ass
3 pages
Chapter 04 Clustering
No ratings yet
Chapter 04 Clustering
36 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Minor Project
No ratings yet
Minor Project
10 pages
LP I Assignment A4 Clustering
No ratings yet
LP I Assignment A4 Clustering
13 pages
5 - Clustering
No ratings yet
5 - Clustering
13 pages
Experiment 9
No ratings yet
Experiment 9
10 pages
Grade 10 - Math Q1W6
No ratings yet
Grade 10 - Math Q1W6
15 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
26 pages
Unit-Iv Material
No ratings yet
Unit-Iv Material
24 pages
K-Means Clustering
No ratings yet
K-Means Clustering
5 pages
Lecture 11 K Means Clustering
No ratings yet
Lecture 11 K Means Clustering
8 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
Report 1
No ratings yet
Report 1
3 pages
Lab Report6 - B21CI014
No ratings yet
Lab Report6 - B21CI014
8 pages
Study Guide and Intervention Solving Compound Inequalities
No ratings yet
Study Guide and Intervention Solving Compound Inequalities
2 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
26 pages
Cluster Analysis and Methods Overview
No ratings yet
Cluster Analysis and Methods Overview
47 pages
Transducers: Anab Batool Kazmi
No ratings yet
Transducers: Anab Batool Kazmi
38 pages
Digital Image Processing: Segmentation-5
No ratings yet
Digital Image Processing: Segmentation-5
43 pages
Final DSA GROUP 1
No ratings yet
Final DSA GROUP 1
19 pages
EE602 - Statistical Signal Processing Lecture 2: Review of DSP Basics For SSP
No ratings yet
EE602 - Statistical Signal Processing Lecture 2: Review of DSP Basics For SSP
26 pages
K Means Algorithms
No ratings yet
K Means Algorithms
27 pages
Cluster Analysis Overview
No ratings yet
Cluster Analysis Overview
77 pages
28 - AI-Regression vs. Classification
No ratings yet
28 - AI-Regression vs. Classification
35 pages
EEE 420 Digital Signal Processing: Instructor: Erhan A. Ince E-Mail
No ratings yet
EEE 420 Digital Signal Processing: Instructor: Erhan A. Ince E-Mail
19 pages
Bisection Method in Numerical Analysis
No ratings yet
Bisection Method in Numerical Analysis
11 pages
Advancing Artificial Intelligence of Things Security: Integrating Feature Selection and Deep Learning For Real-Time Intrusion Detection
No ratings yet
Advancing Artificial Intelligence of Things Security: Integrating Feature Selection and Deep Learning For Real-Time Intrusion Detection
17 pages
Causality: The Impulse Response H (N) of An Ideal Low Pass Filter With Frequency Response
No ratings yet
Causality: The Impulse Response H (N) of An Ideal Low Pass Filter With Frequency Response
3 pages
Parallel Sorting Techniques
No ratings yet
Parallel Sorting Techniques
32 pages
Optimization Method - 2 - : Artificial Intelligence
No ratings yet
Optimization Method - 2 - : Artificial Intelligence
24 pages
Conjugate Gradient Algorithm Guide
No ratings yet
Conjugate Gradient Algorithm Guide
12 pages
OS Scheduling: Round Robin vs Priority
No ratings yet
OS Scheduling: Round Robin vs Priority
9 pages
Mca 4 Sem Machine Learning and Data Analytics Using Python 91855 May 2023
No ratings yet
Mca 4 Sem Machine Learning and Data Analytics Using Python 91855 May 2023
3 pages
Muller's Method Root Finding
No ratings yet
Muller's Method Root Finding
3 pages
Image Restoration
No ratings yet
Image Restoration
28 pages
4 4 Lagrange-Presentation
No ratings yet
4 4 Lagrange-Presentation
18 pages
Geoprocessing: A. Buffer
No ratings yet
Geoprocessing: A. Buffer
12 pages
IF4071 Model
No ratings yet
IF4071 Model
10 pages
Data Analyticskit601 Unit 4 Notes
No ratings yet
Data Analyticskit601 Unit 4 Notes
178 pages
Image Matching with 14 Algorithms
No ratings yet
Image Matching with 14 Algorithms
8 pages
Detection of Broken Blister Using Canny and Rc-Algorithm
No ratings yet
Detection of Broken Blister Using Canny and Rc-Algorithm
4 pages
Algorithms for CS Students
No ratings yet
Algorithms for CS Students
7 pages
A 4 CNMQ QGD45 A6 BOp 53 D J
No ratings yet
A 4 CNMQ QGD45 A6 BOp 53 D J
7 pages
Activity (Tracing The Algorithm) : Perform The Following Operations On The Singly Linked List
No ratings yet
Activity (Tracing The Algorithm) : Perform The Following Operations On The Singly Linked List
3 pages
Matrix
No ratings yet
Matrix
3 pages

K-Means Clustering Explained

Uploaded by

K-Means Clustering Explained

Uploaded by

FINAL PROJECT

K-Means Clustering Analysis and its Applications

One important consideration in K-means clustering is choosing the optimal

In this following picture, we can

ANALYSIS OF K-MEANS CLUSTERING:

In this plot, different clusters are

In a straightforward 24-bit color representation of an image, each pixel is represented as

KMeans_init_centroids(X, K)--> At the start of the K-Means algorithm, we

find_closest_centroids(X, centroids)--> This function will find the

compute_centroids(X, index, K)--> After finding the closest cluster

run_KMeans(X, initial_centroids, max_iters)--> The KMeans Algorithm is

GITHUB LINK FOR CODES:

You might also like