Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
44 views37 pages

Ckustering Datascience

Uploaded by

anon_679166612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views37 pages

Ckustering Datascience

Uploaded by

anon_679166612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Data Science with R

Lesson 12—Clustering

©©Copyright
Copyright 2015,
2015, Simplilearn.
Simplilearn. All rights
All rights reserved.
reserved.
Objectives

• Describe clustering and its use cases


After completing
this lesson, you will • List the types of clustering models
be able to: • Explain K-Means Clustering and its algorithm
• Discuss Hierarchical clustering and its algorithm

© Copyright 2015, Simplilearn. All rights reserved.


Topic 1: Meaning and Uses of Clustering

© Copyright 2015, Simplilearn. All rights reserved.

© Copyright 2015, Simplilearn. All rights reserved.


Introduction to Clustering

It is a type of unsupervised learning that:


• Forms clusters of similar objects automatically
• Segments the data so that each training example is assigned to a
segment

© Copyright 2015, Simplilearn. All rights reserved.


Clustering vs. Classification

Here’s how clustering and classification are different from each other:

Feature Clustering Classification


Type of Learning
Unsupervised Supervised
Method

Segment the data to assign each training Predict the class to which a given training
Purpose example to a segment called a cluster example belongs

© Copyright 2015, Simplilearn. All rights reserved.


Use Cases of Clustering

These include:
• Grouping the content of a website or product in
a retail business
• Segmenting customers or users in different groups on
the basis of their metadata and behavioral characteristics
• Segmenting communities in ecology
• Finding clusters of similar genes
• Creating image segments to be used in image
analysis applications

© Copyright 2015, Simplilearn. All rights reserved.


Topic 2: Clustering Models

© Copyright 2015, Simplilearn. All rights reserved.

© Copyright 2015, Simplilearn. All rights reserved.


Clustering Models

Some examples of clustering models are:


• K-means clustering
• Hierarchical Clustering
• Density-Based Spatial Clustering of Applications with Noise (DBSCAN) Clustering (Rarely used)

© Copyright 2015, Simplilearn. All rights reserved.


Topic 3: K-Means Clustering

© Copyright 2015, Simplilearn. All rights reserved.

© Copyright 2015, Simplilearn. All rights reserved.


K-means Clustering

K-means tries to:


• Partition a set of data points into K distinct clusters
• Find clusters to minimize the sum of squared errors (WCSS) in every cluster

Calculation:

© Copyright 2015, Simplilearn. All rights reserved.


K-means Clustering Algorithm

It includes the following steps:

The k centroids are assigned to a point randomly

Every point in the dataset is assigned to a cluster

All centroids are updated by taking the mean of all the points in that cluster

© Copyright 2015, Simplilearn. All rights reserved.


Pseudocode of K-means

It is as follows:

• Create k points for starting centroids


• While any point is changing a cluster assignment
• for every point in our dataset:
• for every centroid
• calculate the distance between the centroid
and point
• assign the point to the cluster with the lowest
distance
• for every cluster calculate the mean of the points in that
cluster
• assign the centroid to the mean

© Copyright 2015, Simplilearn. All rights reserved.


K-means Clustering Using R

To perform K-means clustering, use the k-means function in the R package stats. The two important
features of this function are:

A parameter used for reducing the algorithm sensitivity to the random selection of the initial
nstart
clusters/cluster means

Cluster labels Labels change from one run to the other

© Copyright 2015, Simplilearn. All rights reserved.


K-means Clustering—Case Study

Consider the given case study:

The monthly and seasonal adjusted unemployment rates, from January 1976 to August 2010,
for 50 U.S. states were captured. The graph below shows the time series plots of three states:
Introduction Iowa (green), New York (red), and California (black).

© Copyright 2015, Simplilearn. All rights reserved.


K-means Clustering—Case Study (contd.)

Consider the given case study:

You need to cluster states group wise.

Problem Assume:
• Each state is characterized by a feature vector with p = 416.
• New York and California form a cluster.

You need to calculate the 416 monthly averages with two observations each.

© Copyright 2015, Simplilearn. All rights reserved.


K-means Clustering—Case Study (contd.)

Consider the given case study:

## read the data; series are stored column-wise with labels in first ## row
raw <- read.csv("C:/DataMining/Data/unempstates.csv")
Solution in R ## transpose the data then we have 50 rows (states) and 416 columns (time periods)
rawt=matrix(nrow=50,ncol=416)
rawt=t(raw)
## k-means clustering in 416 dimensions
set.seed(1)
grpunemp2 <- kmeans(rawt, centers=2, nstart=10)
sort(grpunemp2$cluster)
grpunemp3 <- kmeans(rawt, centers=3, nstart=10)
sort(grpunemp3$cluster)
grpunemp4 <- kmeans(rawt, centers=4, nstart=10)
sort(grpunemp4$cluster)
grpunemp5 <- kmeans(rawt, centers=5, nstart=10)
sort(grpunemp5$cluster)

© Copyright 2015, Simplilearn. All rights reserved.


Demo—Perform Clustering Using K-means

This demo will show the steps to do clustering using k-means.

© Copyright 2015, Simplilearn. All rights reserved.


Topic 4: Hierarchical Clustering

© Copyright 2015, Simplilearn. All rights reserved.

© Copyright 2015, Simplilearn. All rights reserved.


Hierarchical Clustering

It:
• Clusters n units/objects, each with p features, into smaller groups
• Creates a hierarchy of clusters as a dendrogram

Important Points about Dendrograms:


• Units in the same cluster are joined by a horizontal line.
• The leaves at the bottom represent individual units.
• They are useful as they provide a visual representation of clusters.

© Copyright 2015, Simplilearn. All rights reserved.


Hierarchical Clustering Algorithms

They are of two types:

Agglomerative Algorithms Divisive Algorithms


Method: Start at the individual leaves and Method: Start at the root and recursively split the
successively merge clusters together clusters

Approach: Bottom-up
of undertaking the project Approach: Top-down
ascertaining the costs and benefits

© Copyright 2015, Simplilearn. All rights reserved.


Requirements of Hierarchical Clustering Algorithms

Two requirements are:

For categorical variables, it can be defined from their number of matches and
Distance Measure mismatches.

Linkage Creation It determines the choice of clusters to be merged.

A distance measure between two objects with feature vectors Xi = (xi1, xi2, . . . , xip ) and xj = (xj 1, xj 2, . . . ,
! xjp) is non-negative and symmetric and satisfies d(xi , xj ) ≤ d(xi , xk ) + d(xj , xk )).

© Copyright 2015, Simplilearn. All rights reserved.


Agglomerative Clustering Process

In this process:
• An n × n distance matrix is considered, where the number in the ith row and jth column is the distance
between the ith and jth units.
• The distance matrix is symmetric with zeros in the diagonal.
• Rows and columns are merged as clusters and the distances between them are updated.

For the R package cluster Use the “agnes” function

For the stats package Use the “use the hclust” function

© Copyright 2015, Simplilearn. All rights reserved.


Hierarchical Clustering—Case Study

Consider the given case study:

The protein intakes in 25 European countries were captured from 9 food sources, as given in the
table below:

Introduction

© Copyright 2015, Simplilearn. All rights reserved.


Hierarchical Clustering—Case Study (contd.)

Consider the given case study:

You need to determine whether the listed 25 countries can be separated into a smaller number
of clusters.

Problem

© Copyright 2015, Simplilearn. All rights reserved.


Hierarchical Clustering—Case Study (contd.)

Consider the given case study:

library(cluster)
food <- read.csv("C:/DataMining/Data/protein.csv")
foodagg=agnes(food,diss=FALSE,metric="euclidian")
Solution in R plot(foodagg) ## dendrogram

© Copyright 2015, Simplilearn. All rights reserved.


Demo—Perform Hierarchical Clustering

This demo will show the steps to do hierarchical clustering.

© Copyright 2015, Simplilearn. All rights reserved.


Quiz

© Copyright 2015, Simplilearn. All rights reserved.


QUIZ
Identify correct statements about the k-means clustering. Select all that apply.
1

a. It attempts to partition a set of data points into k distinct clusters.

b. It allows to use k-means in the R package named “stats.”

c. It yields single points scattered around the datasets as outliers.

d. It presents a hierarchy of clusters as a dendrogram.

© Copyright 2015, Simplilearn. All rights reserved.


QUIZ
Identify correct statements about the k-means clustering. Select all that apply.
1

a. It attempts to partition a set of data points into k distinct clusters.

b. It allows to use k-means in the R package named “stats”.

c. It yields single points scattered around the datasets as outliers.

d. It presents a hierarchy of clusters as a dendrogram.

The correct
The answers
correct answerare
is a and b.

Explanation: The k-means clustering technique attempts to partition a set of data points into
k distinct clusters and allows to use k-means in the R package names as “stats”.

© Copyright 2015, Simplilearn. All rights reserved.


QUIZ Which of the following statements are true of Hierarchical clustering? Select all that
2 apply.

a. It is inspired by the natural clustering approach.

b. Its algorithms are only of the agglomerative type.

c. It yields single points scattered around the datasets as outliers.

d. It presents a hierarchy of clusters as a dendrogram.

© Copyright 2015, Simplilearn. All rights reserved.


QUIZ Which of the following statements are true of Hierarchical clustering? Select all that
2 apply.

a. It is inspired by the natural clustering approach.

b. Its algorithms are only of the agglomerative type.

c. It yields single points scattered around the datasets as outliers.

d. It presents a hierarchy of clusters as a dendrogram.

The correct answer is d.

Explanation: The hierarchical clustering presents a hierarchy of clusters as a dendrogram.

© Copyright 2015, Simplilearn. All rights reserved.


QUIZ
Identify accurate statements about divisive hierarchical procedures. Select all that apply.
3

a. As we move down the hierarchy, all units start in one cluster and splits are
performed recursively.
b. They represent a top-down approach.

c. They represent a bottom-up approach.

d. They offer high-configurability.

© Copyright 2015, Simplilearn. All rights reserved.


QUIZ
Identify accurate statements about divisive hierarchical procedures. Select all that apply.
3

a. As we move down the hierarchy, all units start in one cluster and splits are
performed recursively.
b. They represent a top-down approach.

c. They represent a bottom-up approach.

d. They offer high-configurability.

The correct
The answers
correct answerare
is b.a and b.

Explanation: These follow a top-down procedure and as we move down the hierarchy, all
units start in one cluster and splits are performed recursively.
.
© Copyright 2015, Simplilearn. All rights reserved.
QUIZ
Which statements about the DBSCAN clustering technique are true? Select all that apply.
4

a. It is available on R's fpc package.

b. It offers high configurability.

c. It yields single points scattered around the datasets as outliers.

A neighborhood of a point p is a set of all points that have a distance


d. measure less than a predetermined value, called Eps.

© Copyright 2015, Simplilearn. All rights reserved.


QUIZ
Which statements about the DBSCAN clustering technique are true? Select all that apply.
4

a. It is available on R's fpc package.

b. It offers high configurability.

c. It yields single points scattered around the datasets as outliers.

A neighborhood of a point p is a set of all points that have a distance


d. measure less than a predetermined value, called Eps.

The correct
The answers
correct answerare
is a.a, b, c, and d.

Explanation: All the given statements are true for the DBSCAN clustering technique.

© Copyright 2015, Simplilearn. All rights reserved.


Summary
Summary

Let us summarize the • Clustering is a type of unsupervised learning that forms clusters of similar
topics covered in this objects automatically.
lesson:
• K-means clustering tries to partition a set of data points into K distinct
clusters.
• Hierarchical clustering clusters n units/objects, each with p features, into
smaller groups.
• DBSCAN clustering is inspired by the natural clustering approach.

© Copyright 2015, Simplilearn. All rights reserved.


This concludes “Clustering.”
The next lesson is “Association.”

© Copyright 2015, Simplilearn. All rights reserved.

© Copyright 2015, Simplilearn. All rights reserved.

You might also like