0% found this document useful (0 votes)

44 views37 pages

Ckustering Datascience

Uploaded by

anon_679166612

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views37 pages

Ckustering Datascience

Uploaded by

anon_679166612

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Data Science with R

Lesson 12—Clustering

©©Copyright
Copyright 2015,
2015, Simplilearn.
Simplilearn. All rights
All rights reserved.
reserved.
Objectives

• Describe clustering and its use cases

After completing
this lesson, you will • List the types of clustering models
be able to: • Explain K-Means Clustering and its algorithm
• Discuss Hierarchical clustering and its algorithm

© Copyright 2015, Simplilearn. All rights reserved.

Topic 1: Meaning and Uses of Clustering

© Copyright 2015, Simplilearn. All rights reserved.

Introduction to Clustering

It is a type of unsupervised learning that:

• Forms clusters of similar objects automatically
• Segments the data so that each training example is assigned to a
segment

© Copyright 2015, Simplilearn. All rights reserved.

Clustering vs. Classification

Here’s how clustering and classification are different from each other:

Feature Clustering Classification

Type of Learning
Unsupervised Supervised
Method

Segment the data to assign each training Predict the class to which a given training
Purpose example to a segment called a cluster example belongs

© Copyright 2015, Simplilearn. All rights reserved.

Use Cases of Clustering

These include:
• Grouping the content of a website or product in
a retail business
• Segmenting customers or users in different groups on
the basis of their metadata and behavioral characteristics
• Segmenting communities in ecology
• Finding clusters of similar genes
• Creating image segments to be used in image
analysis applications

© Copyright 2015, Simplilearn. All rights reserved.

Topic 2: Clustering Models

© Copyright 2015, Simplilearn. All rights reserved.

Clustering Models

Some examples of clustering models are:

• K-means clustering
• Hierarchical Clustering
• Density-Based Spatial Clustering of Applications with Noise (DBSCAN) Clustering (Rarely used)

© Copyright 2015, Simplilearn. All rights reserved.

Topic 3: K-Means Clustering

© Copyright 2015, Simplilearn. All rights reserved.

K-means Clustering

K-means tries to:

• Partition a set of data points into K distinct clusters
• Find clusters to minimize the sum of squared errors (WCSS) in every cluster

Calculation:

© Copyright 2015, Simplilearn. All rights reserved.

K-means Clustering Algorithm

It includes the following steps:

The k centroids are assigned to a point randomly

Every point in the dataset is assigned to a cluster

All centroids are updated by taking the mean of all the points in that cluster

© Copyright 2015, Simplilearn. All rights reserved.

Pseudocode of K-means

It is as follows:

• Create k points for starting centroids

• While any point is changing a cluster assignment
• for every point in our dataset:
• for every centroid
• calculate the distance between the centroid
and point
• assign the point to the cluster with the lowest
distance
• for every cluster calculate the mean of the points in that
cluster
• assign the centroid to the mean

© Copyright 2015, Simplilearn. All rights reserved.

K-means Clustering Using R

To perform K-means clustering, use the k-means function in the R package stats. The two important
features of this function are:

A parameter used for reducing the algorithm sensitivity to the random selection of the initial
nstart
clusters/cluster means

Cluster labels Labels change from one run to the other

© Copyright 2015, Simplilearn. All rights reserved.

K-means Clustering—Case Study

Consider the given case study:

The monthly and seasonal adjusted unemployment rates, from January 1976 to August 2010,
for 50 U.S. states were captured. The graph below shows the time series plots of three states:
Introduction Iowa (green), New York (red), and California (black).

© Copyright 2015, Simplilearn. All rights reserved.

K-means Clustering—Case Study (contd.)

Consider the given case study:

You need to cluster states group wise.

Problem Assume:
• Each state is characterized by a feature vector with p = 416.
• New York and California form a cluster.

You need to calculate the 416 monthly averages with two observations each.

© Copyright 2015, Simplilearn. All rights reserved.

K-means Clustering—Case Study (contd.)

Consider the given case study:

## read the data; series are stored column-wise with labels in first ## row
raw <- read.csv("C:/DataMining/Data/unempstates.csv")
Solution in R ## transpose the data then we have 50 rows (states) and 416 columns (time periods)
rawt=matrix(nrow=50,ncol=416)
rawt=t(raw)
## k-means clustering in 416 dimensions
set.seed(1)
grpunemp2 <- kmeans(rawt, centers=2, nstart=10)
sort(grpunemp2$cluster)
grpunemp3 <- kmeans(rawt, centers=3, nstart=10)
sort(grpunemp3$cluster)
grpunemp4 <- kmeans(rawt, centers=4, nstart=10)
sort(grpunemp4$cluster)
grpunemp5 <- kmeans(rawt, centers=5, nstart=10)
sort(grpunemp5$cluster)

© Copyright 2015, Simplilearn. All rights reserved.

Demo—Perform Clustering Using K-means

This demo will show the steps to do clustering using k-means.

© Copyright 2015, Simplilearn. All rights reserved.

Topic 4: Hierarchical Clustering

© Copyright 2015, Simplilearn. All rights reserved.

Hierarchical Clustering

It:
• Clusters n units/objects, each with p features, into smaller groups
• Creates a hierarchy of clusters as a dendrogram

Important Points about Dendrograms:

• Units in the same cluster are joined by a horizontal line.
• The leaves at the bottom represent individual units.
• They are useful as they provide a visual representation of clusters.

© Copyright 2015, Simplilearn. All rights reserved.

Hierarchical Clustering Algorithms

They are of two types:

Agglomerative Algorithms Divisive Algorithms

Method: Start at the individual leaves and Method: Start at the root and recursively split the
successively merge clusters together clusters

Approach: Bottom-up
of undertaking the project Approach: Top-down
ascertaining the costs and benefits

© Copyright 2015, Simplilearn. All rights reserved.

Requirements of Hierarchical Clustering Algorithms

Two requirements are:

For categorical variables, it can be defined from their number of matches and
Distance Measure mismatches.

Linkage Creation It determines the choice of clusters to be merged.

A distance measure between two objects with feature vectors Xi = (xi1, xi2, . . . , xip ) and xj = (xj 1, xj 2, . . . ,
! xjp) is non-negative and symmetric and satisfies d(xi , xj ) ≤ d(xi , xk ) + d(xj , xk )).

© Copyright 2015, Simplilearn. All rights reserved.

Agglomerative Clustering Process

In this process:
• An n × n distance matrix is considered, where the number in the ith row and jth column is the distance
between the ith and jth units.
• The distance matrix is symmetric with zeros in the diagonal.
• Rows and columns are merged as clusters and the distances between them are updated.

For the R package cluster Use the “agnes” function

For the stats package Use the “use the hclust” function

© Copyright 2015, Simplilearn. All rights reserved.

Hierarchical Clustering—Case Study

Consider the given case study:

The protein intakes in 25 European countries were captured from 9 food sources, as given in the
table below:

Introduction

© Copyright 2015, Simplilearn. All rights reserved.

Hierarchical Clustering—Case Study (contd.)

Consider the given case study:

You need to determine whether the listed 25 countries can be separated into a smaller number
of clusters.

Problem

© Copyright 2015, Simplilearn. All rights reserved.

Hierarchical Clustering—Case Study (contd.)

Consider the given case study:

library(cluster)
food <- read.csv("C:/DataMining/Data/protein.csv")
foodagg=agnes(food,diss=FALSE,metric="euclidian")
Solution in R plot(foodagg) ## dendrogram

© Copyright 2015, Simplilearn. All rights reserved.

Demo—Perform Hierarchical Clustering

This demo will show the steps to do hierarchical clustering.

© Copyright 2015, Simplilearn. All rights reserved.

Quiz

© Copyright 2015, Simplilearn. All rights reserved.

QUIZ
Identify correct statements about the k-means clustering. Select all that apply.
1

a. It attempts to partition a set of data points into k distinct clusters.

b. It allows to use k-means in the R package named “stats.”

c. It yields single points scattered around the datasets as outliers.

d. It presents a hierarchy of clusters as a dendrogram.

QUIZ
Identify correct statements about the k-means clustering. Select all that apply.
1

a. It attempts to partition a set of data points into k distinct clusters.

b. It allows to use k-means in the R package named “stats”.

c. It yields single points scattered around the datasets as outliers.

d. It presents a hierarchy of clusters as a dendrogram.

The correct
The answers
correct answerare
is a and b.

Explanation: The k-means clustering technique attempts to partition a set of data points into
k distinct clusters and allows to use k-means in the R package names as “stats”.

QUIZ Which of the following statements are true of Hierarchical clustering? Select all that
2 apply.

a. It is inspired by the natural clustering approach.

b. Its algorithms are only of the agglomerative type.

c. It yields single points scattered around the datasets as outliers.

d. It presents a hierarchy of clusters as a dendrogram.

QUIZ Which of the following statements are true of Hierarchical clustering? Select all that
2 apply.

a. It is inspired by the natural clustering approach.

b. Its algorithms are only of the agglomerative type.

c. It yields single points scattered around the datasets as outliers.

d. It presents a hierarchy of clusters as a dendrogram.

The correct answer is d.

Explanation: The hierarchical clustering presents a hierarchy of clusters as a dendrogram.

QUIZ
Identify accurate statements about divisive hierarchical procedures. Select all that apply.
3

a. As we move down the hierarchy, all units start in one cluster and splits are
performed recursively.
b. They represent a top-down approach.

c. They represent a bottom-up approach.

d. They offer high-configurability.

QUIZ
Identify accurate statements about divisive hierarchical procedures. Select all that apply.
3

a. As we move down the hierarchy, all units start in one cluster and splits are
performed recursively.
b. They represent a top-down approach.

c. They represent a bottom-up approach.

d. They offer high-configurability.

The correct
The answers
correct answerare
is b.a and b.

Explanation: These follow a top-down procedure and as we move down the hierarchy, all
units start in one cluster and splits are performed recursively.
.
© Copyright 2015, Simplilearn. All rights reserved.
QUIZ
Which statements about the DBSCAN clustering technique are true? Select all that apply.
4

a. It is available on R's fpc package.

b. It offers high configurability.

c. It yields single points scattered around the datasets as outliers.

A neighborhood of a point p is a set of all points that have a distance

d. measure less than a predetermined value, called Eps.

QUIZ
Which statements about the DBSCAN clustering technique are true? Select all that apply.
4

a. It is available on R's fpc package.

b. It offers high configurability.

c. It yields single points scattered around the datasets as outliers.

A neighborhood of a point p is a set of all points that have a distance

d. measure less than a predetermined value, called Eps.

The correct
The answers
correct answerare
is a.a, b, c, and d.

Explanation: All the given statements are true for the DBSCAN clustering technique.

Summary
Summary

Let us summarize the • Clustering is a type of unsupervised learning that forms clusters of similar
topics covered in this objects automatically.
lesson:
• K-means clustering tries to partition a set of data points into K distinct
clusters.
• Hierarchical clustering clusters n units/objects, each with p features, into
smaller groups.
• DBSCAN clustering is inspired by the natural clustering approach.

This concludes “Clustering.”
The next lesson is “Association.”

Highway Alignment Principles
60% (5)
Highway Alignment Principles
89 pages
Post Test Questionnaire EOC EC
No ratings yet
Post Test Questionnaire EOC EC
4 pages
Comprehensive Guide to GA Crossover Techniques
No ratings yet
Comprehensive Guide to GA Crossover Techniques
65 pages
MH 400
No ratings yet
MH 400
81 pages
Day Trading Capital Management Plan
No ratings yet
Day Trading Capital Management Plan
38 pages
Kalimba Song Book For Beginners - Play by Letter
No ratings yet
Kalimba Song Book For Beginners - Play by Letter
168 pages
Estimation and Confidence Intervals
No ratings yet
Estimation and Confidence Intervals
28 pages
Camay Relaunch in Pakistan
100% (1)
Camay Relaunch in Pakistan
26 pages
A Handbook of Statistical Analyses Using R Second Edition
No ratings yet
A Handbook of Statistical Analyses Using R Second Edition
47 pages
Hitch Climbers Guide
No ratings yet
Hitch Climbers Guide
28 pages
Association Datascience
No ratings yet
Association Datascience
37 pages
Aly 8520 To Aly 8526 12V PL
No ratings yet
Aly 8520 To Aly 8526 12V PL
4 pages
Apply Funcs DT
No ratings yet
Apply Funcs DT
32 pages
Economics Module Handbook
No ratings yet
Economics Module Handbook
36 pages
RPMS COT Sheets
No ratings yet
RPMS COT Sheets
12 pages
18 Amazon Rally-1
No ratings yet
18 Amazon Rally-1
11 pages
The Importance of Corporate Communications During Financial Crisis
No ratings yet
The Importance of Corporate Communications During Financial Crisis
12 pages
04-Random-Variate Generation
No ratings yet
04-Random-Variate Generation
18 pages
AI Problem Solving for Engineers
No ratings yet
AI Problem Solving for Engineers
18 pages
Fpv3dcam 3d FPV Camera Blackbird 2 User Guid Eng
No ratings yet
Fpv3dcam 3d FPV Camera Blackbird 2 User Guid Eng
16 pages
7805BG
No ratings yet
7805BG
28 pages
Simulation Thickener
No ratings yet
Simulation Thickener
11 pages
Android File Management Guide
No ratings yet
Android File Management Guide
19 pages
Adani Group Acquires NDTV Assingment No. 1
No ratings yet
Adani Group Acquires NDTV Assingment No. 1
11 pages
Random Vibration Fatigue Analysis of Car Roof Luggage Carrier - Gulsevincler 2021
No ratings yet
Random Vibration Fatigue Analysis of Car Roof Luggage Carrier - Gulsevincler 2021
12 pages
Object Oriented Programming in Java
No ratings yet
Object Oriented Programming in Java
5 pages
HR Interview Questions
No ratings yet
HR Interview Questions
8 pages
Akshatha Paper
No ratings yet
Akshatha Paper
7 pages
Mathematical and Physical Formulas
No ratings yet
Mathematical and Physical Formulas
10 pages
Garduate Nurse Perceptions of The Work Experience
No ratings yet
Garduate Nurse Perceptions of The Work Experience
7 pages
Ckustering Datascience
No ratings yet
Ckustering Datascience
37 pages
Movement Odt
No ratings yet
Movement Odt
1 page
Pawn Movement Bishop Movement
No ratings yet
Pawn Movement Bishop Movement
1 page
CSS 12 Module 5
No ratings yet
CSS 12 Module 5
4 pages
A212 - MC 10 - PROVISIONS, CLCA - Student
No ratings yet
A212 - MC 10 - PROVISIONS, CLCA - Student
4 pages
Hydraulic System CX31 (UENR4778-01)
No ratings yet
Hydraulic System CX31 (UENR4778-01)
4 pages
June - Aug: Beginner
No ratings yet
June - Aug: Beginner
3 pages
11 - 8 - 2022 - 9 - 12 - 58 - 189bachelor of Science B.Sc. - ExamForm
No ratings yet
11 - 8 - 2022 - 9 - 12 - 58 - 189bachelor of Science B.Sc. - ExamForm
2 pages
Use Case Points for Objectory Projects
No ratings yet
Use Case Points for Objectory Projects
9 pages