Ckustering Datascience
Ckustering Datascience
Lesson 12—Clustering
©©Copyright
Copyright 2015,
2015, Simplilearn.
Simplilearn. All rights
All rights reserved.
reserved.
Objectives
Here’s how clustering and classification are different from each other:
Segment the data to assign each training Predict the class to which a given training
Purpose example to a segment called a cluster example belongs
These include:
• Grouping the content of a website or product in
a retail business
• Segmenting customers or users in different groups on
the basis of their metadata and behavioral characteristics
• Segmenting communities in ecology
• Finding clusters of similar genes
• Creating image segments to be used in image
analysis applications
Calculation:
All centroids are updated by taking the mean of all the points in that cluster
It is as follows:
To perform K-means clustering, use the k-means function in the R package stats. The two important
features of this function are:
A parameter used for reducing the algorithm sensitivity to the random selection of the initial
nstart
clusters/cluster means
The monthly and seasonal adjusted unemployment rates, from January 1976 to August 2010,
for 50 U.S. states were captured. The graph below shows the time series plots of three states:
Introduction Iowa (green), New York (red), and California (black).
Problem Assume:
• Each state is characterized by a feature vector with p = 416.
• New York and California form a cluster.
You need to calculate the 416 monthly averages with two observations each.
## read the data; series are stored column-wise with labels in first ## row
raw <- read.csv("C:/DataMining/Data/unempstates.csv")
Solution in R ## transpose the data then we have 50 rows (states) and 416 columns (time periods)
rawt=matrix(nrow=50,ncol=416)
rawt=t(raw)
## k-means clustering in 416 dimensions
set.seed(1)
grpunemp2 <- kmeans(rawt, centers=2, nstart=10)
sort(grpunemp2$cluster)
grpunemp3 <- kmeans(rawt, centers=3, nstart=10)
sort(grpunemp3$cluster)
grpunemp4 <- kmeans(rawt, centers=4, nstart=10)
sort(grpunemp4$cluster)
grpunemp5 <- kmeans(rawt, centers=5, nstart=10)
sort(grpunemp5$cluster)
It:
• Clusters n units/objects, each with p features, into smaller groups
• Creates a hierarchy of clusters as a dendrogram
Approach: Bottom-up
of undertaking the project Approach: Top-down
ascertaining the costs and benefits
For categorical variables, it can be defined from their number of matches and
Distance Measure mismatches.
A distance measure between two objects with feature vectors Xi = (xi1, xi2, . . . , xip ) and xj = (xj 1, xj 2, . . . ,
! xjp) is non-negative and symmetric and satisfies d(xi , xj ) ≤ d(xi , xk ) + d(xj , xk )).
In this process:
• An n × n distance matrix is considered, where the number in the ith row and jth column is the distance
between the ith and jth units.
• The distance matrix is symmetric with zeros in the diagonal.
• Rows and columns are merged as clusters and the distances between them are updated.
For the stats package Use the “use the hclust” function
The protein intakes in 25 European countries were captured from 9 food sources, as given in the
table below:
Introduction
You need to determine whether the listed 25 countries can be separated into a smaller number
of clusters.
Problem
library(cluster)
food <- read.csv("C:/DataMining/Data/protein.csv")
foodagg=agnes(food,diss=FALSE,metric="euclidian")
Solution in R plot(foodagg) ## dendrogram
The correct
The answers
correct answerare
is a and b.
Explanation: The k-means clustering technique attempts to partition a set of data points into
k distinct clusters and allows to use k-means in the R package names as “stats”.
a. As we move down the hierarchy, all units start in one cluster and splits are
performed recursively.
b. They represent a top-down approach.
a. As we move down the hierarchy, all units start in one cluster and splits are
performed recursively.
b. They represent a top-down approach.
The correct
The answers
correct answerare
is b.a and b.
Explanation: These follow a top-down procedure and as we move down the hierarchy, all
units start in one cluster and splits are performed recursively.
.
© Copyright 2015, Simplilearn. All rights reserved.
QUIZ
Which statements about the DBSCAN clustering technique are true? Select all that apply.
4
The correct
The answers
correct answerare
is a.a, b, c, and d.
Explanation: All the given statements are true for the DBSCAN clustering technique.
Let us summarize the • Clustering is a type of unsupervised learning that forms clusters of similar
topics covered in this objects automatically.
lesson:
• K-means clustering tries to partition a set of data points into K distinct
clusters.
• Hierarchical clustering clusters n units/objects, each with p features, into
smaller groups.
• DBSCAN clustering is inspired by the natural clustering approach.