0% found this document useful (0 votes)

11 views10 pages

Full Clustering

The document discusses clustering in machine learning, explaining its purpose of grouping similar data points and the various types of clustering methods, including hard and soft clustering. It details popular algorithms like K Means and Hierarchical clustering, highlighting their differences and applications across various domains such as market segmentation and anomaly detection. The article also emphasizes the potential of clustering to enhance the accuracy of supervised learning algorithms by incorporating cluster labels as independent variables.

Uploaded by

kritika221001105034

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views10 pages

Full Clustering

Uploaded by

kritika221001105034

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Explore Login

Data Science Blogathon: Write, Publish and Earn Start Writing Now!

Home Algorithm Clustering | Different Methods, and Applications (Updated 2024)

Clustering | Different
Methods, and Applications
(Updated 2024)
Sauravkaushik8 Kaushik
13 Jun, 2024 • 9 min read

Introduction

tsil gnidaeR
When encountering an unsupervised learning problem initially,
confusion may arise as you aren’t seeking specific insights but rather
identifying data structures. This process, known as clustering or
cluster analysis, identifies similar groups within a dataset.

It is one of the most popular clustering techniques in data science

used by data scientists. Entities in each group are comparatively
more similar to entities of that group than those of the other groups.
In this article, I will be taking you through the types of clustering,
different clustering algorithms, and a comparison between two of the
most commonly used methods of clustering in machine learning.
Note: To learn more about clustering and other machine learning
algorithms (both supervised and unsupervised) check out the
following courses-
Applied Machine Learning Course
Certified AI & ML Blackbelt+ Program
Learning Objectives
Learn about Clustering in machine learning, one of the most
populartounsupervised
We use cookies on Analytics Vidhya websites classificationwebtechniques.
deliver our services,
6 analyze 32 traffic, and improve your experience on the site. By using Analytics
Vidhya, you agree to our Privacy
Get to know K means and hierarchical Policy andclustering
Terms of Use.
and Accept
the
difference between the two.

Table of contents

What Is Clustering in Machine

Learning?
Clustering is the task of dividing the unlabeled data or data points
into different clusters such that similar data points fall in the same
cluster than those which differ from the others. In simple words, the
aim of the clustering process is to segregate groups with similar traits
and assign them into clusters.
Let’s understand this with an example. Suppose you are the head of a
rental store and wish to understand the preferences of your
customers to scale up your business. Is it possible for you to look at
the details of each customer and devise a unique business strategy
for each one of them? Definitely not. But, what you can do is cluster
all of your customers into, say 10 groups based on their purchasing
habits and use a separate strategy for customers in each of these 10
groups. And this is what we call clustering methods.
Now that we understand what clustering is. Let’s take a look at its
different types.

Types of Clustering in Machine

Learning
Clustering broadly divides into two subgroups:
Hard Clustering: Each input data point either fully belongs to a
cluster or not. For instance, in the example above, every customer
is assigned to one group out of the ten.
Soft Clustering: Rather than assigning each input data point to a
distinct cluster, it assigns a probability or likelihood of the data
point being in those clusters. For example, in the given scenario,
each customer receives a probability of being in any of the ten
retail store clusters.

Different Types of Clustering

Algorithms
Since the task of clustering methods is subjective, the means that
can be used for achieving this goal are plenty. Every methodology
follows a different set of rules for defining the ‘similarity’ among data
points. In fact, there are more than 100 clustering algorithms known.
But few of the algorithms are used popularly. Let’s look at them in
detail:
Connectivity Models
As the name suggests, these models are based on the notion that the
data points closer in data space exhibit more similarity to each other
than the data points lying farther away. These models can follow two
approaches. In the first approach, they start by classifying all data
points into separate clusters & then aggregating them as the distance
decreases. In the second approach, all data points are classified as a
single cluster and then partitioned as the distance increases. Also,
the choice of distance function is subjective. These models are very
easy to interpret but lack scalability for handling big datasets.
Examples of these models are the hierarchical clustering algorithms
and their variants.
Centroid Models
These clustering algorithms iterate, deriving similarity from the
proximity of a data point to the centroid or cluster center. The k-
Means clustering algorithm, a popular example, falls into this
category. These models necessitate specifying the number of
clusters beforehand, requiring prior knowledge of the dataset. They
iteratively run to discover local optima.
Distribution Models
These clustering models are based on the notion of how probable it is
that all data points in the cluster belong to the same distribution (For
example: Normal, Gaussian). These models often suffer from
overfitting. A popular example of these models is the Expectation-
maximization algorithm which uses multivariate normal distributions.
Density Models
These models search the data space for areas of the varied density
of data points in the data space. They isolate different dense regions
and assign the data points within these regions to the same cluster.
Popular examples of density models are DBSCAN and OPTICS. These
models are particularly useful for identifying clusters of arbitrary
shape and detecting outliers, as they can detect and separate points
that are located in sparse regions of the data space, as well as points
that belong to dense regions.
Now I will be taking you through two of the most popular clustering
algorithms in detail – K Means and Hierarchical. Let’s begin.

K Means Clustering
K means is an iterative clustering algorithm that aims to find local
maxima in each iteration. This algorithm works in these 5 steps:
Step1:
Specify the desired number of clusters K: Let us choose k=2 for
these 5 data points in 2-D space.

Step 2:
Randomly assign each data point to a cluster: Let’s assign three
points in cluster 1, shown using red color, and two points in cluster 2,
shown using grey color.

Step 3:
Compute cluster centroids: The centroid of data points in the red
cluster is shown using the red cross, and those in the grey cluster
using a grey cross.
Step 4:
Re-assign each point to the closest cluster centroid: Note that only
the data point at the bottom is assigned to the red cluster, even
though it’s closer to the centroid of the grey cluster. Thus, we assign
that data point to the grey cluster.

Step 5:
Re-compute cluster centroids: Now, re-computing the centroids for
both clusters.

Repeat steps 4 and 5 until no improvements are possible: Similarly,

we’ll repeat the 4th and 5th steps until we’ll reach global optima, i.e.,
when there is no further switching of data points between two
clusters for two successive repeats. It will mark the termination of the
algorithm if not explicitly mentioned.
Here is a live coding window where you can try out K Means
Algorithm using the scikit-learn library.
Hierarchical Clustering
Hierarchical clustering methods, as the name suggests, is an
algorithm that builds a hierarchy of clusters. This algorithm starts
with all the data points assigned to a cluster of their own. Then two
nearest clusters are merged into the same cluster. In the end, this
algorithm terminates when there is only a single cluster left.
The results of hierarchical clustering can be shown using a
dendrogram. The dendrogram can be interpreted as:

At the bottom, we start with 25 data points, each assigned to

separate clusters. The two closest clusters are then merged till we
have just one cluster at the top. The height in the dendrogram at
which two clusters are merged represents the distance between two
clusters in the data space.
The decision of the no. of clusters that can best depict different
groups can be chosen by observing the dendrogram. The best choice
of the no. of clusters is the no. of vertical lines in the dendrogram cut
by a horizontal line that can transverse the maximum distance
vertically without intersecting a cluster.
In the above example, the best choice of no. of clusters will be 4 as
the red horizontal line in the dendrogram below covers the maximum
vertical distance AB.

Important Points for Hierarchical Clustering

This algorithm has been implemented above using a bottom-up
approach. It is also possible to follow a top-down approach
starting with all data points assigned in the same cluster and
recursively performing splits till each data point is assigned a
separate cluster.
The decision to merge two clusters is taken on the basis of the
closeness of these clusters. There are multiple metrics for
deciding the closeness of two clusters:
Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
Manhattan distance: ||a-b||1 = Σ|ai-bi|
Maximum distance:||a-b||INFINITY = maxi|ai-bi|
Mahalanobis distance: √((a-b)T S-1 (-b)) {where, s :
covariance matrix}

Difference Between K Means and

Hierarchical Clustering
Hierarchical clustering methods can’t handle big data well, but K
Means can. This is because the time complexity of K Means is
linear, i.e., O(n), while that of hierarchical is quadratic, i.e., O(n2).
Since we start with a random choice of clusters, the results
produced by running the algorithm multiple times might differ in K
Means clustering. While in Hierarchical clustering, the results are
reproducible.
K Means is found to work well when the shape of the clusters is
hyperspherical (like a circle in 2D or a sphere in 3D).
K Means clustering requires prior knowledge of K, i.e., no. of
clusters you want to divide your data into. But, you can stop at
whatever number of clusters you find appropriate in hierarchical
clustering by interpreting the dendrogram.

Applications of Clustering
Clustering has a large no. of application of clustering spread across
various domains. Some of the most popular applications of clustering
are recommendation engines, market segmentation, social network
analysis, search result grouping, medical imaging, image
segmentation, and anomaly detection.

Improving Supervised Learning

Algorithms With Clustering
Clustering is an unsupervised machine learning approach, but can it
be used to improve the accuracy of supervised machine learning
algorithms as well by clustering the data points into similar groups
and using these cluster labels as independent variables in the
supervised machine learning algorithm? Let’s find out.
Let’s check out the impact of clustering on the accuracy of our model
for the classification problem using 3000 observations with 100
predictors of stock data to predict whether the stock will go up or
down using R. This dataset contains 100 independent variables from
X1 to X100 representing the profile of a stock and one outcome
variable Y with two levels: 1 for the rise in stock price and -1 for drop
in stock price.
The dataset is available here: Download
Let’s first try applying random forest without clustering in python.
#loading required libraries

library('randomForest')
library('Metrics')

#set random seedset.seed(101)

#loading dataset

data<-read.csv("train.csv",stringsAsFactors= T)

#checking dimensions of datadim(data)

## [1] 3000 101

#specifying outcome variable as factor

data$Y<-as.factor(data$Y)

#dividing the dataset into train and testtrain<-data[1:2000,]

test<-data[2001:3000,]

#applying randomForest model_rf<-randomForest(Y~.,data=train)

preds<-predict(object=model_rf,test[,-101])

table(preds)

## preds
## -1 1
## 453 547

#checking accuracy

auc(preds,test$Y)

## [1] 0.4522703

So, the accuracy we get is 0.45. Now let’s create five clusters based
on values of independent variables using k-means and reapply
random forest.
#combing test and train

all<-rbind(train,test)

#creating 5 clusters using K- means clustering

Cluster <- kmeans(all[,-101], 5)

#adding clusters as independent variable to the dataset.all$cluster<-as.fact

#dividing the dataset into train and testtrain<-all[1:2000,]

test<-all[2001:3000,]

#applying randomforestmodel_rf<-randomForest(Y~.,data=train)

preds2<-predict(object=model_rf,test[,-101])

table(preds2)

## preds2

## -1 1

##548 452

auc(preds2,test$Y)

## [1] 0.5345908

Whoo! In the above example, even though the final accuracy is poor
but clustering has given our model a significant boost from an
accuracy of 0.45 to slightly above 0.53.
This shows that clustering can indeed be helpful for supervised
machine-learning tasks.

Conclusion
In this article, we have discussed the various ways of performing
clustering. We came across application of clustering for unsupervised
learning in a large no. of domains and also saw how to improve the
accuracy of a supervised machine learning algorithm using clustering.
Although clustering is easy to implement, you need to take care of
some important aspects, like treating outliers in your data and making
sure each cluster has a sufficient population. These aspects of
clustering are dealt with in great detail in this article.
Key Takeaways
Clustering helps to identify patterns in data and is useful for
exploratory data analysis, customer segmentation, anomaly
detection, pattern recognition, and image segmentation.
It is a powerful tool for understanding data and can help to reveal
insights that may not be apparent through other methods of
analysis.
Its types include partition-based, hierarchical, density-based, and
grid-based clustering.
The choice of clustering algorithm and the number of clusters to
use depend on the nature of the data and the specific problem at
hand.

Frequently Asked Questions

centroid Cluster Analysis clustering Clustering analyses
Clustering analysis Hierarchical Clustering K-means Clustering
live coding machine learning Supervised Learning
two-step clustering unsupervised learning

Sauravkaushik8 Kaushik
13 Jun 2024
Saurav is a Data Science enthusiast, currently in the final year of his
graduation at MAIT, New Delhi. He loves to use machine learning and
analytics to solve complex data problems.

Algorithm Clustering Data Science Intermediate

Machine Learning

Frequently Asked Questions

What is clustering in machine learning?
A. Clustering in machine learning involves grouping similar data points
together based on their features, allowing for pattern discovery without
predefined labels.

What is clustering and its type?

What is an example of clustering?

How does clustering work?

Responses From Readers

What are your thoughts?...

Submit reply

Ankit Gupta
03 Nov, 2016
Very nice tutorial Saurav!
1 Show 1 reply
Richard Warnung
03 Nov, 2016
Nice, post! Please correc the last link - it is broken - thanks!
1 Show 1 reply
Sai Satheesh G
03 Nov, 2016
I accept that clustering may help in improving the supervised
models. But here in the above: Clustering is performed on sample
points (4361 rows). Is that right.? But I think correct way is to
cluster features (X1-X100) and to represent data using cluster
representatives and then perform supervised learning. Can you
please elaborate further? Why samples are being clustered in the

Write for us
Write, captivate, and earn accolades and rewards for your work

Reach a Global Audience Cash In on Your Knowledge

Get Expert Feedback Join a Thriving Community
Build Your Brand & Audience Level Up Your Data Science Game

Sion Chakrabarti CHIRAG GOYAL

16 87

Company Discover Learn Engage Contribute Enterprise

About Us Blogs Free courses Community Contribute & win Our offerings
Contact Us Expert session Learning path Hackathons Become a Case studies
speaker
Careers Podcasts BlackBelt Events Industry report
program Become a
Comprehensive Daily challenges mentor quexto.ai
Guides Gen AI
Become an
instructor

Brinkgreve Et Al 2023 - Automated CPT Interpretation and Modelling in A BIM-Digital Twin Environment
No ratings yet
Brinkgreve Et Al 2023 - Automated CPT Interpretation and Modelling in A BIM-Digital Twin Environment
7 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
21 pages
UL Coded Project Report - KC
No ratings yet
UL Coded Project Report - KC
30 pages
Python Data Science Handbook - Python Data Science Handbook
No ratings yet
Python Data Science Handbook - Python Data Science Handbook
4 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
6 pages
Unit 4-DWDM
No ratings yet
Unit 4-DWDM
23 pages
Clustering
No ratings yet
Clustering
41 pages
Unit 4
No ratings yet
Unit 4
62 pages
Clustering
No ratings yet
Clustering
4 pages
ML Unit 4
No ratings yet
ML Unit 4
110 pages
ML Unit 4 (Ab 22)
No ratings yet
ML Unit 4 (Ab 22)
39 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
64 pages
Chapter 4
No ratings yet
Chapter 4
60 pages
Unsupervised Learning-01
No ratings yet
Unsupervised Learning-01
42 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
Unit VII
No ratings yet
Unit VII
30 pages
MLT Unit 3 Notes
No ratings yet
MLT Unit 3 Notes
32 pages
Unit 4
No ratings yet
Unit 4
16 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
66 pages
2013 Strand - A Motor Speech Assessment For Children With
No ratings yet
2013 Strand - A Motor Speech Assessment For Children With
17 pages
Lecturer-1 Unit 3
No ratings yet
Lecturer-1 Unit 3
31 pages
UNIT II-Segmentation, Positioning, and Product Optimization
No ratings yet
UNIT II-Segmentation, Positioning, and Product Optimization
48 pages
ML Unit 3
No ratings yet
ML Unit 3
28 pages
Mod3 DM
No ratings yet
Mod3 DM
20 pages
Orange Visual Programming
No ratings yet
Orange Visual Programming
222 pages
Data Mining - 5
No ratings yet
Data Mining - 5
4 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
Clustering
No ratings yet
Clustering
21 pages
ML Unit-Iii
No ratings yet
ML Unit-Iii
18 pages
Data Mining Clustering Guide
No ratings yet
Data Mining Clustering Guide
56 pages
Ifferent Methods of Clustering
No ratings yet
Ifferent Methods of Clustering
8 pages
Clustering New
No ratings yet
Clustering New
6 pages
Artificial Intelligence Lec 5
No ratings yet
Artificial Intelligence Lec 5
20 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
63 pages
Day 3 - Content
No ratings yet
Day 3 - Content
50 pages
Clustering Methods in Machine Learning
No ratings yet
Clustering Methods in Machine Learning
45 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
21 pages
Unit III Clustering
No ratings yet
Unit III Clustering
47 pages
Clustering
No ratings yet
Clustering
11 pages
4.unit 4 ML Q&A
No ratings yet
4.unit 4 ML Q&A
73 pages
Clustering: An Overview: Key Concepts Objective
No ratings yet
Clustering: An Overview: Key Concepts Objective
12 pages
Clustering
No ratings yet
Clustering
8 pages
Clustering
No ratings yet
Clustering
3 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Clustering
No ratings yet
Clustering
20 pages
Clustering Techniques Explained
No ratings yet
Clustering Techniques Explained
9 pages
BI UNIT-03 Chap02 Clustering
No ratings yet
BI UNIT-03 Chap02 Clustering
8 pages
Classification vs Clustering Guide
No ratings yet
Classification vs Clustering Guide
31 pages
Clustering
No ratings yet
Clustering
57 pages
Unit 4-L2
No ratings yet
Unit 4-L2
19 pages
Clustering
No ratings yet
Clustering
6 pages
Unit 2 ML
No ratings yet
Unit 2 ML
11 pages
BRUGHMANS, T. 2010: Connecting The Dots: Towards Archaeological Network Analysis. Oxford Journal of Archaeology.
100% (1)
BRUGHMANS, T. 2010: Connecting The Dots: Towards Archaeological Network Analysis. Oxford Journal of Archaeology.
46 pages
An Introduction To Clustering Methods
No ratings yet
An Introduction To Clustering Methods
8 pages
Ds Econtent
No ratings yet
Ds Econtent
8 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
Unit 5
No ratings yet
Unit 5
5 pages
An Introduction To Different Methods of Clustering in Machine Learning
No ratings yet
An Introduction To Different Methods of Clustering in Machine Learning
8 pages
Amity School of Engineering and Technology Amity University, Uttar Pradesh
No ratings yet
Amity School of Engineering and Technology Amity University, Uttar Pradesh
5 pages
Itae002 Test 2
No ratings yet
Itae002 Test 2
150 pages
Zara
No ratings yet
Zara
47 pages
Clustering Techniques for Analysts
No ratings yet
Clustering Techniques for Analysts
7 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
93 pages
Bank Customer Segmentation
No ratings yet
Bank Customer Segmentation
14 pages
Partitioning Methods & Hierachical Methods
No ratings yet
Partitioning Methods & Hierachical Methods
22 pages
Big Data
No ratings yet
Big Data
12 pages
Creditcard Fraud Detection
No ratings yet
Creditcard Fraud Detection
26 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Data Science Master's Program
No ratings yet
Data Science Master's Program
31 pages
Pam Clustering Technique: Bachelor of Technology Computer Science and Engineering
No ratings yet
Pam Clustering Technique: Bachelor of Technology Computer Science and Engineering
11 pages
Java Pattern Matching & Clustering
No ratings yet
Java Pattern Matching & Clustering
16 pages
Unit Iv - Notes
No ratings yet
Unit Iv - Notes
42 pages
M.Tech Data Science Weekend Syllabus
No ratings yet
M.Tech Data Science Weekend Syllabus
18 pages
Graph Based Anomaly Detection and Description: A Survey: Leman Akoglu Hanghang Tong Danai Koutra
No ratings yet
Graph Based Anomaly Detection and Description: A Survey: Leman Akoglu Hanghang Tong Danai Koutra
63 pages
High Dimensional Data Clustering Using Cuckoo Search Optimization Algorithm
No ratings yet
High Dimensional Data Clustering Using Cuckoo Search Optimization Algorithm
5 pages
Overview PDF
No ratings yet
Overview PDF
2 pages
Data Warehousing & Mining Course
No ratings yet
Data Warehousing & Mining Course
2 pages
MLT Answer Key
No ratings yet
MLT Answer Key
10 pages
Data Analytics KCS 051
No ratings yet
Data Analytics KCS 051
2 pages
Security Enhancement in Healthcare Cloud Using Mac
No ratings yet
Security Enhancement in Healthcare Cloud Using Mac
10 pages
0 Content Crime Studies Case Study Tamil Nadu
No ratings yet
0 Content Crime Studies Case Study Tamil Nadu
10 pages
Bayesian Radar
No ratings yet
Bayesian Radar
6 pages
R for Customer Segmentation Enthusiasts
No ratings yet
R for Customer Segmentation Enthusiasts
5 pages
An Ensemble of Modified Support Vector Regression Models For Data-Driven Prognostics
No ratings yet
An Ensemble of Modified Support Vector Regression Models For Data-Driven Prognostics
6 pages
Lab07 KMeans Assignment
No ratings yet
Lab07 KMeans Assignment
13 pages
Demonstration of WEKA Tool
No ratings yet
Demonstration of WEKA Tool
43 pages

Full Clustering

Uploaded by

Full Clustering

Uploaded by

Explore Login

Home Algorithm Clustering | Different Methods, and Applications (Updated 2024)

It is one of the most popular clustering techniques in data science

What Is Clustering in Machine

Types of Clustering in Machine

Different Types of Clustering

Repeat steps 4 and 5 until no improvements are possible: Similarly,

At the bottom, we start with 25 data points, each assigned to

Important Points for Hierarchical Clustering

Difference Between K Means and

Improving Supervised Learning

#set random seedset.seed(101)

#checking dimensions of datadim(data)

## [1] 3000 101

#specifying outcome variable as factor

#dividing the dataset into train and testtrain<-data[1:2000,]

#applying randomForest model_rf<-randomForest(Y~.,data=train)

#creating 5 clusters using K- means clustering

Cluster <- kmeans(all[,-101], 5)

#adding clusters as independent variable to the dataset.all$cluster<-as.fact

#dividing the dataset into train and testtrain<-all[1:2000,]

Frequently Asked Questions

Algorithm Clustering Data Science Intermediate

Frequently Asked Questions

What is clustering and its type?

What is an example of clustering?

How does clustering work?

Responses From Readers

Reach a Global Audience Cash In on Your Knowledge

Sion Chakrabarti CHIRAG GOYAL

Company Discover Learn Engage Contribute Enterprise

You might also like