Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views12 pages

Clustering Algorithm and Analyasis

This is related to information technology subject related topic and you can read it

Uploaded by

Sidra n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views12 pages

Clustering Algorithm and Analyasis

This is related to information technology subject related topic and you can read it

Uploaded by

Sidra n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Clustering Algorithm

and Its Applications in Data


Mining
Presented by
Sidra Siddiqa BIT-21-04 Tasneem Gull BIT-
21-10
Amna Noor BIT-21-04 Irsa Malik BIT-21-32
UmeTehreem BIT-21-82 Maryam Iqbal BIT-21-86
Mahnoor BIT-21-34
Clustering (Introduction)
 Clustering is a type of unsupervised machine learning
 Clustering analysis is one of the most important research fields in data mining.
Data analysis can find useful information and is widely used in fields such as
market research, data analysis, pattern recognition, image processing, and
artificial intelligence and Web document classification.
 It is distinguished from supervised learning by the fact that there is not a priori
output (i.e. no labels)
 The task is to learn the classification/grouping from the data
 A cluster is a collection of objects which are similar in some way
 Example: a group of people clustered based on their height and weight
 Normally, clusters are created using distance measures
 Two or more objects belong to the same cluster if they are “close” according
to
a given distance (in this case geometrical distance like Euclidean or
Manhattan)
 For Example: Bread, Eggs, Butter, Milk these items may include in a cluster
 Business Intelligence applications
 Biological Applications
 Some possible applications of clustering
 data reduction – reduce data that are homogeneous (similar)
 find “natural clusters” and describe their unknown properties
 find useful and suitable groupings
Clustering Analysis of the Basic Concepts
 Cluster Analysis is a method of studying individuals based on the
characteristics of things themselves, with the purpose of
classifying similar things. Its principle is that individuals in the
same category have greater similarity, and individuals in different
categories have the smallest similarity (that is, the difference is
greater)
 Assuming the data set contains n data objects
 there is a data matrix (Data Matrix)
 x11 ⋯ x1f ⋯ x1p
xi1 ⋯ xif ⋯ xip
xn1 ⋯ xnf ⋯ xnp

where xif represents the fth attribute value of the ith object in
the data set. The matrix represents the sum of the attribute
records for each object in the dataset.
 Calculate the average of the absolute deviation
sf = 1/n (|x1f − mf| + |x2f − mf| + ⋯ + |xnf − mf|)
Clustering Analysis of the Basic Concepts
 Among them:
mf = 1/n ( x1f + x2f + ⋯ + xnf )
 The normalized measure is:
zif = xif − mf/ sf
 The corresponding metric distance formula has the following
common forms: Euclidean distance formula:
d(x, y) = √ (x1 − y1)2 + (x2 − y2)2 + ⋯ + (xn − yn)2
 Manhattan distance formula:
d(x, y) = |x1 − y1 | | + | |x2 − y2 | | + ⋯ + | |xn − yn |
 Mingkosiji distance formula:
 d(x, y)=( | |x1 − y1 | |q + | |x2 − y2 | | q + ⋯ + | |xn − yn | | q ) 1 /q
 where q is a positive integer. When q=1, it represents the Manhattan
distance, When q=2, it represents the Euclidean distance.
 . Thus, the dissimilarity between two objects, all composed of
discrete variables, can be calculated by a simple matching method
as follows:
d(x, y) = p − m/ p
where m is the number of attributes that match the attribute values in
object x and y; p is the total number of attributes.
Cluster Analysis
Algorithm
 First, feature selection. Features must be chosen appropriately
to include as much of the task-related information as possible
 Second, the similarity measure used to quantitatively
measure how two feature vectors are “similar” or “dissimilar
 Third, the clustering algorithm. Having chosen the appropriate
similarity measure, this step involves selecting a particular
clustering algorithm to reveal the clustering structure in the
data set.
 Fourth, the result verification. Once the result is obtained
using the clustering algorithm, its validity needs to be verified.
 Fifth, the result is judged. In many cases, experts in the feld of
application must use other experimental data and analysis to
determine the clustering results, and fnally make the correct
conclusions
 Given the number of clusters k and the objective function F
Clustering Algorithm
Process
 Clustering algorithm process Feature
Selection Similarity Measure Clustering
Algorithm Result Verification Result
Determination
Feature
Selection
Result
determination
Similarity
Measure

Clustering Result
Algorithm verification
K means Algorithm
Implementation Process
 K-means algorithm is a kind of rapid clustering analysis method
which is widely used. It has higher execution efficiency and larger
sample data volume.
 However, the sample size of the research design is not large, and
the processing time is definitely not the primary consideration in
dealing with this type of problem. Therefore, K-means clustering
can be considered. It provides a cluster analysis function, which can
perform cluster analysis of samples or variables on a variety of data
types.
Given K, the K-means algorithm is implemented in four steps:
1. Choose K points at random as cluster centres (centroids)
2. Assign each instance to its closest cluster centre using certain
distance measure (usually Euclidean or Manhattan)
3. Calculate the centroid of each cluster, use it as the new cluster
centre (one measure of centroid is mean)
4. Go back to Step 2, stop when cluster centres do not change any
more
K- means algorithm an
example...
 Say, we have the data: {20, 3, 9, 10, 9, 3, 1, 8, 5, 3,
24, 2, 14, 7, 8, 23, 6, 12, 18} and we are asked to
use K-means to cluster these data into 3 groups
 Assume we use Manhattan distance

 Step one: Choose K points at random to be cluster


centres
 Say 6, 12, 18 are chosen
K- means algorithm an
example...
Step two: Assign each
instance to its closest cluster
centre using Manhattan
distance
For instance:
20 is assigned to cluster 3
3 is assigned to cluster 1
K- means algorithm
example...
Step two continued: 9 can be assigned
to cluster 1, 2 but let us say that it is
arbitrarily assigned to cluster 2
Repeat for all the rest of the instances
K -means algorithm an
example...
Step three: Calculate the centroid (i.e. mean)
of each cluster, use it as the new cluster
centre

End of iteration 1
Step four: Iterate (repeat steps 2 and 3) until
the cluster centres do not change any more
Conclusion
 With the development of society and
science and technology, the big data of
society has been paid more and more
attention by people and the information
that people can use is also increasing.
However, users’ ability to process and
understand these data information remains
the same. How to accurately fnd the parts
of their interest from these huge data
information and how to classify these
information involves a new direction, that
is, data mining research. The text proposes
a method of research and analysis using

You might also like