Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
30 views43 pages

Unsupervised Learning

Data Science content

Uploaded by

Peneal Samuel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views43 pages

Unsupervised Learning

Data Science content

Uploaded by

Peneal Samuel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Unsupervised Learning

Contents
 Introduction to Unsupervised Learning
 Tasks in Unsupervised Learning
 Clustering
 Dimensionality reduction
 Anomaly detection
 Association rule mining
 Applications of Unsupervised Learning
 Clustering and Types of Clustering
 K- means Clustering

Unsupervised Learning 2
Introduction to Unsupervised
Learning
o A sub-field of machine learning in which patterns are
learnt from datasets consisting of samples without labels
i.e., there exist only features but no targets variables in
the given data
o No notion of dependent and independent variables
o Useful for understanding the distribution of data or
patterns in data and extracting valuable information from
it
o Tasks include: clustering, dimensionality reduction,
anomaly/outlier detection and association rule mining

Unsupervised Learning 3
Clustering 𝑥2

o Finding homogeneous
subgroups within the data such
that data points (samples)
within a subgroup are similar
o Subgroups are referred to as
clusters
o Similarity is decides based on
some similarity measure
o Gives an intuition about the
structure and distribution of
data 𝑥1

Unsupervised Learning 4
Clustering 𝑥2

o Finding homogeneous subgroups


within the data such that data
points within a subgroup are similar
o Subgroups are referred to as
clusters
o Similarity is decides based on some
similarity measure
o Gives an intuition about the
structure and distribution of data
o Useful in many applications where
data points are to be grouped
without any target labels 𝑥1

Unsupervised Learning 5
Dimensionality Reduction
𝑝 <𝑛

[]
o Mapping features in a higher

[ ]
𝑥1 ~
𝑥1
dimensional space to a lower 𝑥2 ~ ~
𝑥
𝒙= → 𝒙= 2
dimensional space without ⋮ ⋮
𝑥𝑛 ~
loss of much information 𝑥𝑝
o Principal component analysis
𝒙 ~
𝒙
and autoencoders are

𝑛

unsupervised learning
𝑝
techniques used for
dimensionality reduction
o Useful in data compression

and feature extraction

Unsupervised Learning 6
𝑥2
Anomaly/Outlier Detection
o Finding unusual or
unexpected data points in the
dataset that differ from the
rest
o Anomalies occur rarely in
data but detecting them is
important
o Works under the assumption
that features of an outlier or
anomaly point are
significantly different from
normal points 𝑥1

Unsupervised Learning 7
Anomaly/Outlier Detection
o Finding unusual or unexpected
data points in the dataset that
differ from the rest
o Anomalies occur rarely in data
but detecting them is
important
o Works under the assumption
that features of an outlier or
anomaly point are significantly
different from normal points
o Anomaly or outlier detection in
time series data is another
important area of study

Unsupervised Learning 8
Association Rule Mining
o Unsupervised learning task for Example: Consider two sets of items
discovering relations (rules) X and Y
between different variables in a
dataset
o Given a set of transactions,
these rules predict the An implication rule can be defined as
occurrence of an item in the follows:
transaction based on occurrence
Which means if the items in X occur
of other items in the transaction
o More suitable for non-numeric, in a transaction, then items in Y also
occur in the transaction with high
categorical data probability

Unsupervised Learning 9
Applications of Unsupervised
Learning
o Medical:
o Categorising people into different groups based on different
healthcare parameters and medical images
o Based on the common conditions or diseases that a group
of people may possess, certain conclusions about the
condition or disease can be made
o Association mining can also be used to form rules between
symptoms and diseases which help doctors in diagnosis
o Engineering:
o Detecting faults (anomaly) in manufacturing or a process
industry
o Sudden changes in process parameters such temperature,
pressure, power, vibration, etc. can be monitored and
analysed using unsupervised learning techniques
Unsupervised Learning 10
Applications of Unsupervised
Learning
o Search Engines:
o Grouping together search results based on a search phrase
involves unsupervised learning
o Google news uses unsupervised learning to categorize
articles on the same story from various online news outlets
o Image Grouping:
o Grouping of pictures in a smart phone or in a social media
account
o Pictures with identical features are grouped together

Unsupervised Learning 11
Applications of Unsupervised
Learning
o Market Based Analysis:
o Intelligent recommendations to consumers based on
association rule mining
o Data collected from supermarkets or e-commerce websites
is mined for finding the associations between products
which are frequently bought together
Consumer Item 1 Item 2 Item 3
Consumer 1 Eggs Bread Jam
Consumer 2 Apple Bread Jam
Consumer 3 Apple Banana Soup
Consumer 4 Apple Banana Eggs

Unsupervised Learning 12
Clustering

Unsupervised Learning 14
Clustering 𝑥2

o Finding clusters within the data


such that data points within a
cluster are similar
o Question: How to decide the
similarity between different
data points?
o Similarity measures
o Distance between points
o Density of points
o Probability of belonging to a
distribution 𝑥1

Unsupervised Learning 15
Types of Clustering Algorithms

Unsupervised Learning 16
Types of Clustering Algorithms
o Clustering algorithms can be categorised as follows based
on the similarity metric used to cluster data:
o Distance based metric:
 Centroid based clustering
 Hierarchical clustering
o Density based metric:
 Density based clustering
o Probability based metric:
 Distribution based clustering

Unsupervised Learning 17
Centroid based Clustering
𝑥 2

• Intuition: There is a
centroid/centre for the cluster
and all the points in the
cluster are at a close distance
to the centroid
• In these algorithms, the
number of clusters is to be
decided apriori – a drawback
• K-means clustering is the most
popular algorithm among 𝑥1
centroid based methods

Unsupervised Learning 18
Hierarchical Clustering
• Typically used when the dataset is large
• Constructs a hierarchy among all the data points and
then based on the hierarchy puts them into different
clusters
• If a hierarchy exists, then it is used to cluster data
otherwise distance metrics can be used to cluster data
hierarchically
• No need to choose the number of cluster apriori
• Two Approaches:
• Bottom Up approach – Agglomerative approach – most
popular
• Top down Approach – Divisive approach

Unsupervised Learning 19
Hierarchical Clustering
– Bottom Up Approach
• Steps:
1. Each data point in the dataset is
considered a cluster initially
2. Compute the distance between all
Level 1 𝑝1 𝑝2 𝑝3 𝑝4 𝑝5 𝑝6
clusters (centroid) based on some
distance metric
3. Merge two clusters which are
closest to one another
4. Repeat Steps 2 and 3 until desired
Level 2 𝑝 2 ,𝑝 3

level of clustering is obtained
Different heuristics can be used to Level 3 𝑝4 , 𝑝5
determine when to stop clusters or
after desired number of clusters
are obtained Level 4 𝑝4 ,𝑝5 ,𝑝6
Unsupervised Learning 20
Unsupervised Learning 21
Hierarchical Clustering
– Top Down Approach
• Exactly opposite to Bottom-Up Approach
• Steps:
1. Consider all the data points to be in one single cluster
2. Partition the cluster into two clusters which are not similar based on a distance
metric
3. Repeat until desired level of clustering is obtained

Image source: Medium.com


Unsupervised Learning 22
Hierarchical Clustering – Visualisation
• Hierarchy of clusters can be represented using a
dendrogram

Image source: Medium.com


Unsupervised Learning 23
Distance Measures
• Distance measure is a function which gives distance between
two data points
• If the function returns 0, then the two data points are
equivalent
• If distance is low, points can be considered to be similar and
vice-versa
• Most used distance measures are as follows:
1. Euclidian distance
2. Manhattan distance
3. Cosine Distance or Similarity

Unsupervised Learning 24
Manhattan Distance 𝑥2
• Consider two data points ( and )
in a two-dimensional vector
space

• Manhattan distance (L1 norm) is


given by:

• Works well if the points are [ ]


𝑥𝑏 1
𝑥𝑏 2
arranged in the form of a grid
• E.g. Distance between houses
¿ 𝑥 𝑏2 − 𝑥 𝑎 2∨¿
arranged in a grid
• Recommended for high
dimensional data [ ]
𝑥𝑎 1
𝑥𝑎 2 ¿ 𝑥 𝑏1 − 𝑥 𝑎 1∨¿ 𝑥1
Ref:
The Surprising Behaviour of Distance Metrics in Hi
gh Dimensions | by
z_ai | Towards
Unsupervised LearningData Science 25
Euclidian Distance 𝑥2

• One of the most popular


distance metrics
• Euclidian distance (L2 norm) is
given by:

• Gives the geometric distance


between two points in the [ ]
𝑥𝑏 1
𝑥𝑏 2

vector space 𝑑
• Not recommended for high
dimensional data
[ ]
𝑥𝑎 1
𝑥𝑎 2
𝑥1
Ref:
The Surprising Behaviour of Distance Metrics in Hi
gh Dimensions | by
z_ai | Towards
Unsupervised LearningData Science 26
Cosine Distance 𝑥2
• Distance is measured in terms of
the angle between two feature
vectors
• Cosine distance is given by:

• Useful when the orientation of the


vectors is more important than the [ ]
𝑥𝑎 1
𝑥𝑎 2

distance [ ]
𝑥𝑏 1
𝑥𝑏 2
• If vectors are pointing in same
direction
• If vectors are orthogonal or
unrelated 𝜃
• If vectors are in opposite directions
𝑥1

Unsupervised Learning 27
Density and Distribution based
Clustering

Unsupervised Learning 28
Density based Clustering
• Note: Distance based methods assume
that the clusters are in specific shape
(spherical or elliptical)
• Density based: No assumption on shape of
cluster
• Intuition: Groups data points with high Clusters after Density based clustering
density into one cluster
• Points not in high density region are not
clustered and are considered outlier points
• Useful:
• When clusters are of varied shapes but
are densely populated
• to separate outliers from the points in
Unsupervised Learning 29
dense regions
Density based Clustering
• DBSCAN-most popular density based clustering technique Density-

Based Spatial Clustering of Applications with Noise


• Example: To separate high valued customers from a large group of
customers based on their purchase patterns DBSCAN applied on whole data

Annual
Electronics
Purchases
(scaled)

Annual Grocery purchases (scaled)


Non-clustered data represents
Unsupervised Learning high valued customers 30
Distribution based Clustering
 Groups data points based on their likely hood of belonging to
the same probability distribution
 Each cluster is assumed to be drawn from a different
distribution (different parameters)
 Distribution needs to be assumed – Gaussian, Binomial, etc.
 Can be used only when it is known that the data comes from
well known distributions
 Gaussian mixture model is an example of distribution based
clustering algorithm

Unsupervised Learning 31
K-Means Clustering

Unsupervised Learning 32
K-Means Clustering 𝑥2

• A centroid based clustering


technique which uses Euclidian
distance as distance metric
• data points are clustered into
clusters
• A data point will belong to
cluster to whose centre it is
nearest
• Centre in this case is the mean
vector of all data points
(sample vectors) in the cluster
𝑥1

Unsupervised Learning 33
Step to K-Means Clustering
1. Select the number of clusters () into which the data is to be
grouped
2. Randomly initialise the centres of each cluster (Heuristics can be
used) or this initialisation can be done multiple times
3. Compute the Euclidian distance from each centre to each of the
data points in the dataset
4. Group each data point in the cluster to whose centre it is closest
5. After grouping, re-compute the centre of each cluster by taking the
mean of all data points
6. Repeat the steps 3 to 5 until the cluster centres don’t change much
 Mathematically, the sum of distances of the data points to the
cluster centres are getting minimised

Unsupervised Learning 34
K-Means Clustering - Example
4

Sample Feature x Feature y


3.5
1 1 1
3
2 1.5 1.5
2.5
3 1 0.5
4 0.8 1.2 2

y
5 3.3 3.1 1.5

6 2.58 3.68 1

7 3.5 2.8 0.5

8 3 3 0
0.5 1 1.5 2 2.5 3 3.5 4

Unsupervised Learning 35
K-Means Clustering - Example
Step 1: Let Randomly choose two points as the cluster centers
Mean x Mean y
Centre 1 1 1
Centre 2 3 3
Step 2: Compute the distances and group the closest ones
4
Sample distance 1 distance 2 Cluster
3.5
1 0 2.8284271 1
3
2 0.7071068 2.1213203 1
2.5
3 0.5 3.2015621 1
group 1
2

y
4 0.2828427 2.8425341 1 group 2
1.5
5 3.1144823 0.3162278 2
1
6 3.111077 0.7992496 2
0.5
7 3.0805844 0.5385165 2
0
0.5 1 1.5 2 2.5 3 3.5 4
8 2.8284271 0 2
x
Unsupervised Learning 36
K-Means Clustering - Example
Step 3: Compute the new centres (mean of samples in
respective clusters) and repeat step 2
Mean x Mean y
Centre 1 1.075 1.05
Centre 2 3.095 3.145

Step 4: If change in mean is negligible or no reassignment then


stop the process 4
Sample distance 1 distance 2 Cluster
3.5
1 0.0901388 2.9983412 1
3
2 0.6189709 2.2912988 1
2.5
3 0.5550901 3.374174 1
2 group 1

y
4 0.3132491 3.0083301 1 group 2
1.5 new group1 mean
5 3.0254132 0.2098809 2 new group 2 mean
1
6 3.0301691 0.7425968 2
0.5
7 2.9905058 0.5320244 2
0
8 2.7400958 0.1733494 2 0.5 1 1.5 2 2.5 3 3.5 4

Unsupervised Learning x 37
K-Means Clustering - Illustration

Source: wikipedia

Unsupervised Learning 38
Determining Number of Clusters (k)

 Elbow method is generally used Sum of distances

to estimate the optimal value of


k for k means clustering
 Value of is varies from 2-10
(say) and for each value of k,
the sum of distance of samples
from their centres is computed
and plotted
 In the plot, the point where the
curve plateaus is an indicator of
the optimal number of clusters

Unsupervised Learning 39
Silhouette score

Unsupervised Learning 40
𝑥2
Silhouette score Cluster 2
 To evaluate the quality of clusters (Nearest to cluster 1)

created using clustering algorithms


 It measures how well samples are
clustered with other samples that are
similar
 Following distances are required to
calculate the silhouette score:
 Mean distance between a data
point (sample) and all other data
points in the same cluster – Cluster 1
denoted by
𝑥1
 Mean distance between a data
point (sample) and all other data
points of the
Unsupervised nearest cluster -
Learning 41
Silhouette score
• Silhouette score, S, is calculated for each sample using the following
formula:

• Silhouette score varies from


• If the score is , the cluster is dense and well-separated than other
clusters.
• A value near represents overlapping clusters with samples very
close to the decision boundary of the neighboring clusters.
• A negative score indicates that the samples might have got assigned
to the wrong clusters
• Silhouette scores can be plotted and be used to select the most
optimal value of the K (no. of clusters) in K-means clustering.

Unsupervised Learning 42
Summary
 Unsupervised learning involves techniques which
extracts useful information from unlabelled data
 Clustering, dimensionality reduction, association mining
and anomaly detection are well known unsupervised
learning tasks
 Many types of clustering exist based on different metrics
such as distance, density and probability
 K-mean clustering is the simplest and most popular
among clustering techniques
 K-means clustering clusters the data into k clusters
based on Euclidean distance of the points to the centre
(mean) of the cluster
 Silhouette score measures the quality of clusters that are
formed for distance-based clustering methods
Unsupervised Learning 44
THANK YOU

You might also like