Unsupervised
Learning
Agenda
● What is unsupervised Learning?
● Clustering Techniques
● Hierarchical Clustering
● K Means Clustering
Unsupervised Learning
● Unsupervised learning is a type of machine learning in which models are
trained using unlabeled dataset and are allowed to act on that data without
any supervision
● These algorithms discover hidden patterns or data groupings without the
need for human intervention
● Unsupervised learning models are utilized for three main tasks—clustering,
association, and dimensionality reduction
● Its ability to discover similarities and differences in information make it the
ideal solution for exploratory data analysis, cross-selling strategies, customer
segmentation, and image recognition
Clustering Technique
● Grouping unlabeled examples is called clustering
● Clustering is the task of dividing the population or data points into a number
of groups such that data points in the same groups are more similar to other
data points in the same group than those in other groups.
● Hard Clustering
○ Each data point either belongs to a cluster completely or not
● Soft Clustering
○ Instead of putting each data point into a separate cluster, a probability or likelihood
of that data point to be in those clusters is assigned
Hierarchical Clustering
● We develop the hierarchy of clusters in the form of a tree, and this tree-shaped structure is
known as the dendrogram
● The dendrogram is a tree-like structure that is mainly used to store each step as a memory
that the HC algorithm performs
● The hierarchical clustering technique has two approaches:
○ Agglomerative is a bottom-up approach, in which the algorithm starts with taking all data points as
single clusters and merging them until one cluster is left.
○ Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down approach
● Linkage Criteria:
○ Single Linkage: It is the Shortest Distance between the closest points of the clusters
○ Complete Linkage: It is the farthest distance between the two points of two different clusters. It is one
of the popular linkage methods as it forms tighter clusters than single-linkage
○ Average Linkage: It is the linkage method in which the distance between each pair of datasets is
added up and then divided by the total number of datasets to calculate the average distance between
two clusters.
○ Centroid Linkage: It is the linkage method in which the distance between the centroid of the clusters is
calculated
K Means Clustering
● It is an iterative algorithm that divides the unlabeled dataset into k different
clusters in such a way that each dataset belongs only one group that has
similar properties
● It is a centroid-based algorithm, where each cluster is associated with a
centroid. The main aim of this algorithm is to minimize the sum of distances
between the data point and their corresponding clusters
● Cluster centers are initialized randomly, so you might need to try repeatedly
to get best possible clusters
● How to get best / optimum numbers of clusters:
○ Plots a curve between calculated WCSS (Within Cluster Sum of Squares) values and
the number of clusters K
○ The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K