Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views57 pages

Module 3 Clustering

Cluster analysis, or clustering, is a data mining method that groups similar data points together to form clusters, facilitating the organization of unlabelled data. Various clustering methods include partitioning, hierarchical, density-based, grid-based, and model-based approaches, each with unique techniques and applications. For instance, the DBSCAN method identifies density-connected points to form clusters while filtering out noise, making it effective for datasets with arbitrary shapes.

Uploaded by

nazalmhd02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views57 pages

Module 3 Clustering

Cluster analysis, or clustering, is a data mining method that groups similar data points together to form clusters, facilitating the organization of unlabelled data. Various clustering methods include partitioning, hierarchical, density-based, grid-based, and model-based approaches, each with unique techniques and applications. For instance, the DBSCAN method identifies density-connected points to form clusters while filtering out noise, making it effective for datasets with arbitrary shapes.

Uploaded by

nazalmhd02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Mod3 – Introduction to Clustering

Cluster analysis

Cluster analysis also known as clustering , is a method of data mining
that groups similar data points together.

The goal of cluster analysis is to divide a datasets into groups(or
cluster) such that the data points within each group are more similar
to each other than to data points in other grops.

The given data is divided into different groups by combining similar
objects into a group. This group is a cluster. A cluster is a collection of
similar data which is grouped together.

Cluster: A collection of data objects similar (or
related) to one another within the same group and
dissimilar (or unrelated) to the objects in other
groups

Cluster analysis - Finding similarities between
data according to the characteristics found in the
data and grouping similar data objects into clusters

For example, consider a dataset of vehicles given in which it contain information
about different vehicles like cars, buses, bicycles,etc. As it is unsupervised
learning there are no class labels like cars, bikes, etc for all the vehicles all the
data is combined and is not in a structured manner.

Now our task is to convert the unlabelled data to labelled data and it can be done
using clusters.

The main idea of cluster analysis is that it would arrange all the data points by
forming clusters like cars clusters which contains all the cars,bikes clusters which
contains all the bikes, etc.

Simply it is the partitioning of similar objects which are applied to unlabelled data.
There can be different similarity measures
Clustering Methods:

Partitioning methods

Hierarchical Clustering Methods

Density-based Methods

Grid-based methods

Model-based methods
Partitioning methods
Given a database of n objects or data tuples, a partitioning method
constructs k partitions of the data, where each partition represents a
cluster and k ≤ n. That is, it classifies the data into k groups, which
together satisfy the following requirements:
(1) each group must contain at least one object, and
(2) each object must belong to exactly one group.
In the partitioning method, there is one technique called iterative
relocation, which means the object will be moved from one group to
another to improve the partitioning.
PAM(Partitioning Around Medoid) or
K-Medoid
i x y
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6

Apply K-Medoid clustering algorithm to form two clusters.

Use Manhattan distance to find the data point and medoid.
Steps:
– Select two medoids
– C1=(3,4), C2=(7,4)
to find the distance use manhattan distance formula:
(x1,y1) and (x2, y2) are data points

Manhattan Distance =|x1-x2|+|y1-y2|
recent:///f0b28661b1f9f6a09297d1c867b4afbe
Hierarchical Clustering Methods:
Hierarchical Clustering Methods:
-clustering is done by hierarchical decomposition
-Objects are grouped into tree of clusters
Two types of hierarchical method depending on whether hierarchical decomposition is made
bottom-up(merging is done) or top-down(splitting is done):

Agglomerative: Bottom-up approach, Initially each object will be in a seperate cluster(bottom).


Then successively merges the clusters that are close to one another until all of the objects are
merged into one(the top-most level of the hierarchy), or until a termination condition holds(such
as a desired number of clusters is obtained or the diameter of each cluster is within a certain
threshold).
Divisive approach: Top-down approach, starts with all
objects in one cluster. In each successive iteration, a cluster
is split up into smaller cluster until eventually each object is in
one cluster, or until a termination condition holds(such as a
desired number of clusters is obtained or the diameter of
each cluster is within a certain threshold).
Density-based Methods
Density-based Methods : based on the notion of
density(number of objects or data points). It continue clustering
as long as the density in the “neighbourhood” exceeds some
threshold;

Such a method can be used to filter out noise (outliers) and
discover clusters of arbitrary shape
DBSCAN

Density Based Spatial Clustering of appliction with noise.

It has 2 inputs (E and minpoints())

E- radius of circle formed with dataobject as center.

Minpts()- minimum no of datapoints inside the circle.

3 types of datapoints.
– Core points:- it should satisfy the condition of minpoints.
– Boundarypoint:- neighbour of core.
– Noise point:- not core nor boundary.
DBSCAN - Procedure
● A density-based cluster is defined as a group of density connected
points. (ie, find a group of density-connected points.)
● The algorithm of density-based clustering works as follow:

● For each point xi, compute the distance between xi and the other
points.
● Finds all neighbor points within distance eps of the starting point xi.
Each point, with a neighbor count greater than or equal to MinPts, is
marked as core point or visited.(In this step , core points are
identified.)
● For each core point, if it’s not already assigned to a
cluster, create a new cluster. Find recursively all its
density connected points and assign them to the
same cluster as the core point.
● Iterate through the remaining unvisited points in
the dataset.
● Those points that do not belong to any cluster are

treated as outliers or noise.


● Advantages
● Applicable for spatial database
● Discovery of clusters with arbitrary shape,
● Good efficiency on large databases, i.e., on databases of significantly more than just a few
thousand objects.
● 4) Minimal requirements of domain knowledge to determine the input parameters, because

appropriate values are often not known in advance when dealing with large databases.
● 5) Only two parameters are required

● 6) the number of clusters does not need to be specified by the user

● 7) Since it has a concept of noise, it works well even with noisy datasets.

● Disadvantages

● Not good at handling high dimensional data


Grid-based methods

Grid-based methods: Grid-based methods quantize the object space
into a finite number of cells that form a grid structure. All of the
clustering operations are performed on the grid structure (i.e., on the
quantized space). The main advantage of this approach is its fast
processing time, which is typically independent of the number of data
objects and dependent only on the number of cells in each dimension in
the quantized space.
Model-based methods

Model-based methods: Model-based methods hypothesize a model for each of
the clusters and find the best fit of the data to the given model.

EM is an algorithm that performs expectation maximization analysis based on
statistical modeling.

COBWEB is a conceptual learning algorithm that performs probability analysis
and takes concepts as a model for clusters.

SOM (or self-organizing feature map) is a neural network-based algorithm.
ROCK Clustering algorithm

ROCK is Robust Clustering using linKs.

ROCK belongs to the class of agglomerative
hierarchical clustering algorithms.

ROCK works for categorical attributes.

You might also like