See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/343059183
Clustering in Data Science
Presentation · July 2020
CITATIONS READS
0 150
1 author:
Nilu Singh
Koneru Lakshmaiah Education Foundation
121 PUBLICATIONS 383 CITATIONS
SEE PROFILE
All content following this page was uploaded by Nilu Singh on 04 May 2023.
The user has requested enhancement of the downloaded file.
Clustering in Data Science
Dr. Nilu Singh
School of Computer Applications
Babu Banarasi Das University
Lucknow-UP
Content
• Introduction of Clustering
• Clustering in Machine Learning
• Need of Clustering
• Types of Clustering
• Clustering Methods
• Types of clustering algorithms
• Applications of Clustering
• References
Clustering
• Clustering is a Machine Learning technique
that involves the grouping of data points.
• Given a set of data points, we can use a
clustering algorithm to classify each data
point into a specific group.
Clustering in Machine Learning
• It is basically a type of unsupervised
learning method.
• Clustering is the task of dividing the
population or data points into a number of
groups.
• Ex: Data points in the same groups are
more similar to other data points in the
same group and dissimilar to the data
points in other groups.
Cont...
• It is basically a collection of objects on
the basis of similarity and dissimilarity
between them.
Need of Clustering
• It is very much important as it determines
the intrinsic grouping among the
unlabeled data present.
• There are no criteria for a good
clustering.
• It depends on the user, what is the
criteria they may use which satisfy their
need.
Types of Clustering
clustering can be divided into two
subgroups:
Hard Clustering- In this each data point
either belongs to a cluster completely or
not.
Soft Clustering- In this instead of putting
each data point into a separate cluster, a
probability or likelihood of that data point to
be in those clusters is assigned.
Clustering Methods
Density-Based Methods
Hierarchical Based Methods
Partitioning Methods
Grid-based Methods
Types of clustering algorithms
• There are more than 100 clustering
algorithms known. But few of the
algorithms are used popularly, such as-
Connectivity models
Centroid models
Distribution models
Density Models
Cont...
Connectivity models:
• These models are based on the notion that
the data points closer in data space exhibit
more similarity to each other than the data
points lying farther away.
• These models are very easy to interpret but
lacks scalability for handling big datasets.
• Examples of these models are hierarchical
clustering algorithm and its variants.
Cont...
Centroid models:
• These are iterative clustering algorithms in
which the notion of similarity is derived by
the closeness of a data point to the centroid
of the clusters.
• Ex: K-Means clustering algorithm.
Cont...
Distribution models:
• These clustering models are based on the
notion of how probable is it that all data
points in the cluster belong to the same
distribution.
• Example of these models is Expectation-
maximization algorithm which uses
multivariate normal distributions.
Cont...
Density Models:
• These models search the data space for
areas of varied density of data points in the
data space.
• Examples of density models are DBSCAN
and OPTICS.
Applications of Clustering
Some of the most popular applications of
clustering are:
Recommendation engines
Market segmentation
Social network analysis
Search result grouping
Medical imaging
Image segmentation
Anomaly detection
Cont...
Marketing : It can be used to characterize &
discover customer segments for marketing
purposes.
Libraries : It is used in clustering different books
on the basis of topics and information.
Cont...
City Planning: It is used to make groups of
houses and to study their values based on
their geographical locations and other
factors present.
Earthquake studies: By learning the
earthquake-affected areas we can
determine the dangerous zones.
Improving Supervised Learning Algorithms
with Clustering
• Clustering is an unsupervised machine
learning approach.
• but can it be used to improve the accuracy
of supervised machine learning algorithms
as well by clustering the data points into
similar groups and using these cluster labels
as independent variables in the supervised
machine learning algorithm.
https://www.dummies.com/programming/big-data/data-
science/clustering-algorithms-used-in-data-science/
https://www.geeksforgeeks.org/clustering-in-machine-
learning/
https://www.analyticsvidhya.com/blog/2016/11/an-
introduction-to-clustering-and-different-methods-of-
clustering/
https://medium.com/cracking-the-data-science-
interview/an-introduction-to-big-data-clustering-
1a911b83e590
View publication stats