Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
36 views14 pages

CURE

The document describes CURE, a clustering algorithm for large datasets. CURE uses multiple representative points to model each cluster, capturing varying shapes and sizes. It is robust to outliers and can identify non-spherical clusters unlike previous approaches. The algorithm and enhancements for scaling to large data are explained.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views14 pages

CURE

The document describes CURE, a clustering algorithm for large datasets. CURE uses multiple representative points to model each cluster, capturing varying shapes and sizes. It is robust to outliers and can identify non-spherical clusters unlike previous approaches. The algorithm and enhancements for scaling to large data are explained.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 14

CURE: An Efficient Clustering

Algorithm for Large Databases


Authors: Presentation by:
Vuk Malbasa
Sudipto Guha, For
Rajeev Rastogi, CIS664
Kyuseok Shim Prof. Vasilis
Megalooekonomou
Overview
• Introduction
• Previous Approaches
• Drawbacks of previous approaches
• CURE: Approach
• Enhancements for Large Datasets
• Conclusions
Introduction
• Clustering problem: Given points separate them
into clusters so that data points within a cluster
are more similar to each other than points in
different clusters.
• Traditional clustering techniques either favor
clusters with spherical shapes and similar sizes
or are fragile to the presence of outliers.
• CURE is robust to outliers and identifies clusters
with non-spherical shapes, and wide variances
in size.
• Each cluster is represented by a fixed number of
well scattered points.
Introduction

• CURE is a hierarchical clustering


technique where each partition is nested
into the next partition in the sequence.
• CURE is an agglomerative algorithm
where disjoint clusters are successively
merged until the number of clusters
reduces to the desired number of clusters.
Previous Approaches

• At each step in agglomerative clustering the


merged clusters are ones where some distance
metric is minimized.
• This distance metric can be:
– Distance between means of clusters, dmean
– Average distance between all points in clusters, dave
– Maximal distance between points in clusters, dmax
– Minimal distance between points in clusters, dmin
Drawbacks of previous approaches
• For situations where clusters vary in size
dave, dmax and dmean distance metrics will
split large clusters into parts.
• Non spherical clusters will be split by dmean
• Clusters connected by outliers will be
connected if the dmin metric is used
• None of the stated approaches work well
in the presence of non spherical clusters
or outliers.
Drawbacks of previous approaches
CURE: Approach
• CURE is positioned between centroid
based (dave) and all point (dmin) extremes.
• A constant number of well scattered
pointsis used to capture the shape and
extend of a cluster.
• The points are shrunk towards the centroid
of the cluster by a factor α.
• These well scattered and shrunk points
are used as representative of the cluster.
CURE: Approach
• Scattered points approach alleviates
shortcomings of dave and dmin.
– Since multiple representatives are used the
splitting of large clusters is avoided.
– Multiple representatives allow for discovery of
non spherical clusters.
– The shrinking phase will affect outliers more
than other points since their distance from the
centroid will be decreased more than that of
regular points.
CURE: Approach
• Initially since all points are in separate clusters, each cluster is
defined by the point in the cluster.
• Clusters are merged until they contain at least c points.
• The first scattered point in a cluster in one which is farthest away
from the clusters centroid.
• Other scattered points are chosen so that their distance from
previously chosen scattered points in maximal.
• When c well scattered points are calculated they are shrunk by
some factor α (r = p + α*(mean-p)).
• After clusters have c representatives the distance between two
clusters is the distance between two of the closest representatives
of each cluster
• Every time two clusters are merged their representatives are re-
calculated.
Enhancements for Large Datasets
• Random sampling
– Filters outliers and allows the dataset to fit into
memory
• Partitioning
– First cluster in partitions then merge partitions
• Labeling Data on Disk
– The final labeling phase can be done by NN on
already chosen cluster representatives
• Handling outliers
– Outliers are partially eliminated and spread out by
random sampling, are identified because they belong
to small clusters that grow slowly
Conclusions
• CURE can identify clusters that are not
spherical but also ellipsoid
• CURE is robust to outliers
• CURE correctly clusters data with large
differences in cluster size
• Running time for a low dimensional
dataset with s points is O(s2)
• Using partitioning and sampling CURE can
be applied to large datasets
Thanks!
?

You might also like