0% found this document useful (0 votes)

27 views33 pages

ML Clustering

The document provides an overview of machine learning concepts, including supervised, unsupervised, and semi-supervised learning, along with various techniques such as clustering and classification. It discusses the scikit-learn library, its functionalities for preprocessing, model selection, and evaluation, as well as clustering methods like k-means. Additionally, it highlights the importance of distance metrics in clustering and the evaluation of clustering quality.

Uploaded by

vagifsamadov2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views33 pages

ML Clustering

Uploaded by

vagifsamadov2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

11/22/2023

Machine Learning is…

… learning from data
… on its own

145

Machine Learning is…

… learning from data
… on its own
… discovering hidden patterns

146

1
11/22/2023

Machine Learning is…

… learning from data
… on its own
… discovering hidden patterns
… data-driven decisions

147

Supervised Learning

Purpose
Given a dataset {(x i ,yi ) ∈ X × Y, i = 1,...,N}, learn the
dependancies between X and Y.

► Example: Learn the links between cardiac risk and food

habits. x i is one person describe by d features concerning its
food habits; yi is a binary category (risky, not risky).

► yi are essential for the learning process.

► Methods : K-Nearest Neighbors, SVM, Decision Tree, . . .

148

2
11/22/2023

Unsupervised Learning

Purpose
From observations {x i ∈ X ,i = 1,...,N}, learn the organisation
of X and discover homogen subsets.
► Example: Categorize customers. x i encodes a customer with
features encoding its social condition and behavior.

► Methods: Hierarichal clustering, K-Means, Reinforcement

learning, . . .

149

Semi supervised Learning

Purpose
Within a dataset, only a small part of sample have a corresponding
label, i.e. {(x 1, y1), ···,(x k ,yk ), x k+1, ···,N}. The goal is to
infer the classes of unlabeled data.
► Example: Filter webpages. Number of webpages is
tremendous, only few of them can be labeled by an expert.
► Methods: Bayesian methods, SVM, Graph Neural Networks,
...

150

3
11/22/2023

Supervised vs.
Unsupervised
•Supervised Approaches
• Target (what model is predicting) is provided
• ‘Labeled’ data
• Classification & regression are supervised.

•Unsupervised Approaches
• Target is unknown or unavailable
• ‘unlabeled’ data
• Cluster analysis & association analysis are
unsupervised.

151

Categories of Machine Learning Techniques

Supervised Unsupervised
(target is available) (target is not available)

Classification Cluster
Analysis
Regression Association
Analysis

152

4
11/22/2023

Classification Goal: Predict category

Sunny

Windy

Rainy

Cloudy
Image source:
http://www.davidson.k12.nc.us/parents students/inclement_weather

153

Regression
Predict numeric value
Goal:

154

5
11/22/2023

Goal: Organize similar

Cluster Analysis
items into groups.
Seniors
Adults

Teenagers

Image source: http://www.monetate.com/blog/the-intrinsic-value-of-customer-

segmentation

155

Association Analysis

Goal: Find rules to capture

associations between items.

156

6
11/22/2023

scikit-learn

• Open source library for Machine Learning in

Python
• Built on top of NumPy, SciPy, matplotlib
• Active community for development
• Improved continuously by developers

157

Preprocessing Tools

•Utility Functions for

• Transforming raw feature vectors to suitable format

•Provides API for

• Scaling of features: remove mean and keep unit variance
• Normalization to have unit norm
• Binarization to turn data into 0 or 1 format
• One Hot Encoding for categorical features
• Handling of missing values
• Generating higher order features
• Build custom transformations

158

7
11/22/2023

Different Tasks
► Supervised Learning
► Unsupervised Learning
► Semi Supervised Learning

159

Provides organized tutorials with specifics.

http://scikit-learn.org/stable/documentation.html

160

8
11/22/2023

Dimensionality Reduction
• Enables you to reduce features while preserving variance
• scikit-learn has capabilities for:
• Principal Component Analysis (PCA)
• Singular Value Decomposition
• Factor Analysis
• Independent Component Analysis
• Matrix Factorization
• Latent Dirichlet Allocation

161

Model Selection

• Provides methods for Cross Validation

• Library functions for tuning hyper parameters

• Model Evaluation mechanisms to measure model performance

• Plotting methods for visualizing scores to evaluate models

162

9
11/22/2023

Summary of scikit-learn

• Extensive set of tools for full pipeline in Machine Learning

• Dependable due to community support

• Provides easy to use API for training, and making predictions

• Collection of the best, most popular, algorithms in one place

163

Clustering

164

10
11/22/2023

Clustering http://scikit-learn.org/stable/modules/clustering.html#clustering

• sklearn.cluster gives algorithms for grouping of unlabeled data

165

Cluster Analysis Overview

Goal: Organize similar items into groups

166

11
11/22/2023

Cluster Analysis Examples

• Segment customer base into groups
• Characterize different weather patterns for a region
• Group news articles into topics
• Discover crime hot spots
• NLP: Find set of texts
• Documents: Automatic classification (Driver License,
ID, Passport)
• Marketing: Client profiles

167

Cluster Analysis
• Divides data into clusters
• Similar items are placed in same cluster
Intra-cluster
differences are
minimized

Inter-cluster differences are

v maximized

168

12
11/22/2023

Distance – main focus

Euclidean Distance

169

Distance – other methods

A A

B B
Cosine Similarity

Manhattan Distance
► Minkoswski Distance (norm p)

► Manhattan distance (p = 1)

170

13
11/22/2023

Distance dm(x 1, x 2) I

► Euclidean distance (p = 2)

171

Distance dm(x 1, x 2) II

► Matrix based distance, W sdp

► Mahalanobis distance W = C −1 with C covariance matrix.

172

14
11/22/2023

Distance between discrete values

► Let x1 ∈ { c1 , . . . , ck } et x 2 ∈ { d1 , . . . , dh }
► Contingency table A(x 1 , x2 ) = [aij ]
► aij : times when x1 = ci AND x2 = dj

► Hamming Distance: sum when vectors differ

|A∪B |−|A∩B |
► Jaccard : dJ (A, B) = |A∪B |

173

Distance properties

Four properties of a metric

dm : X × X → [0, inf )
1. Non-negativity : dm (x, y) ≥ 0
2. Symmetry : dm (x, y) = dist(y, x)
3. Identity : dm (x, y) = 0 ⇔x = y
4. Triange inequality : dm(x, y) ≤ dm(x, z) + dm(z, y)

174

15
11/22/2023

Distance between Clusters

How to estimate dm (C1, C2) ?

175

Illustration

Single Linkage Complete Linkage

Average Linkage Centers of gravity

176

16
11/22/2023

How to evaluate the quality of a

clustering ?

177

How to evaluate the quality of a

clustering ?

error = distance between sample & centroid

X squared error = error2

Sum of squared errors between all

samples & centroid

Sum over all clusters WSSE

Within-Cluster Sum of Squared Error
= Intra Cluster Inertia Jw

178

17
11/22/2023

How to evaluate the quality of a

clustering ?

WSSE1 < WSSE2 WSSE1 is better numerically

Caveats:
• Does not mean that cluster set 1 is
more ‘correct’ than cluster set 2
• Larger values for k will always reduce
WSSE

179

How to evaluate the quality of a

clustering ?

180

18
11/22/2023

A Good
Clustering
C1
g1
C3

g3
C2
g g
g4 C4
g2

Total Inertia = Intra Cluster Inertia + Inter Cluster Inertia

Good Partition ?

Minimise Intra cluster Inertia and Maximise Inter Cluster

Inertia

181

A Good
Partition

High Inter Cluster Inertia Low Inter Cluster Inertia

Low Intra Cluster Inertia High Intra Cluster Inertia

g1 g2

182

19
11/22/2023

Terms used: Similarity and

Dissimilarity

► Dissimilarity dm: small value → points are close (e.g.

distance)
dm(x, z) = ǁx − zǁ 22
► Similarity sm : big value → points are close (e.g. RBF)

ǁx − zǁ 2
sm (x, z) = exp−
σ

183

Normalizing Input Variables

Scaled Values

Weight
Height

184

20
11/22/2023

Cluster Analysis Notes

Unsupervised

There is no ‘correct’
clustering

Clusters don’t come

with labels

Interpretation and analysis required to

make sense of clustering results!

185

Uses of Cluster Results

• Data segmentation
• Analysis of each segment can provide insights
science fiction

non-fiction

children’s

186

21
11/22/2023

Uses of Cluster Results

• Categories for classifying new data
• New sample assigned to closest cluster
• Label of closest cluster used to
classify new sample

187

Uses of Cluster Results

• Labeled data for classification

• Cluster samples used as labeled data

Labeled samples
for science fiction
customers

188

22
11/22/2023

Uses of Cluster Results

• Basis for anomaly detection

• Cluster outliers are anomalies

Anomalies that
require further
v analysis

189

• Organize similar items into groups

• Analyzing clusters often leads to useful
insights about data
• Clusters require analysis and interpretation

190

23
11/22/2023

Questions
raised:

► Data Nature: Binary, texts, numeric, trees, . . .

► Similarity between data
► What is a cluster ?
► What is a good cluster ?
► How many clusters ?
► Which algorithm ?
► Evaluation of clustering results

191

Clustering
Methods
► Many methods exist . . .
► Hierarchical Clustering
► Agglomerative Clustering
► Distances used
► Agglomeration strategies
► Splitting Clustering
► Kmeans and derivatives
► DBSCAN
► Spectral Clustering
► ...

► Modelisation Clustering
► Gaussian Mixtures models
► One Class SVM

192

24
11/22/2023

k-Means Clustering

193

Cluster Analysis
• Divides data into clusters
• Similar items are in same cluster
Intra-cluster
differences are
minimized

Inter-cluster differences are

maximized

194

25
11/22/2023

k-Means Algorithm
Select k initial centroids (cluster centers)
Repeat
Assign each sample to closest centroid
Calculate mean of cluster to determine new centroid
Until some stopping criterion is reached

centroid
X

195

K-means for Clustering

Purpose
►D = { xi ∈ Rd } i =1,···,N
► Clustering in K < N clusters Ck

Brute Force
1. Build all possible partitions
2. Evaluate each clustering et keep the best clustering

Problem
Number of possible clusterings increases exponentially

For N = 10 and K = 4, we have 34105 possible clusterings !

196

26
11/22/2023

K-means for Clustering

• A better solution
► Minimizing intra-class inertia, w.r.t. µk , k = 1,…,K

► Use of an heuristic: we will have a good clustering but not

necessarily the best one according to Jw

197

K-means for Clustering

A famous algorithm: K-means

1. Consider we have gravity centers µk , k = 1, ···, K
2. we affect each xi to the closest cluster Cl :

3. We recompute µk for each Ck, k = 1, ···, K

4. We continue until we reach convergence

198

27
11/22/2023

K-means algorithm

199

K-Means :
illustration
Clustering in K = 2 clusters
Data Initialisation Iteration 1
La vérité vraie Initialisation Clusters obtenus à l’iteration 1
4 4 4

3 3 3

2 2 2

1 1 1

0 0 0

−1 −1 −1

−2
−3 −2 0 1 2 3 4 5
−2 −2
−1
−4 −2 0 2 4 6 −4 −2 0 2 4 6

Iteration 2 Iteration 3 Iteration 5

Clusters obtenus à l’iteration 2 Clusters obtenus à l’iteration 3 Clusters obtenus à l’iteration 5
4 4 4

3 3 3

2 2 2

1 1 1

0 0 0

−1 −1 −1

−2 −2 −2
−4 −2 0 2 4 6 −4 −2 0 2 4 6 −4 −2 0 2 4 6

200

28
11/22/2023

Choosing Initial Centroids

Issue:
Final clusters are sensitive to initial centroids

Solution:
Run k-means multiple times with
different random initial centroids,
and choose best results

201

Choosing Value for k

• Approaches: k=?
• Visualization

• Application-Dependent

• Data-Driven

202

29
11/22/2023

How to choose the number of clusters ?

K clusters
► Hard problem; depends on data
► Fixed a priori by the problematic
► Search for the best partition for different K > 1; Find a break
in Jw (K ) decreasing
► Constrain the density and/or volume of clusters
► Use criteria to evaluate clusterings
► Compute clustering for each K = 1, . . . , Kmax
► Compute criteria J(K )
► Choose K ∗ the K having the best criteria

203

Elbow Method for Choosing k

“Elbow” suggests value for
k should be 3

204

30
11/22/2023

K-Means : Discussion

► Jw decreases at each iteration

► It converges towards a local minimum of Jw
► Quick convergence
► Initialisation of µk :
► Randomly within xi domain
► Randomly K among X
► Different initializations lead to different
clustering

205

Stopping Criteria
X

When to stop iterating?

• No changes to centroids
• Number of samples changing clusters
is below threshold

206

31
11/22/2023

Some
criteria

207

Some
Criteria

208

32
11/22/2023

Interpreting Results
• Examine cluster centroids
• How are clusters different?

X
X Compare centroids
to see how clusters
are different
X

209

K-Means Summary

• Classic algorithm for cluster analysis

• Simple to understand and implement
and is efficient
• Value of k must be specified
• Final clusters are sensitive to initial
centroids

210

Kindergarten Syllabus
100% (4)
Kindergarten Syllabus
2 pages
Cambridge DELTA PDA Part1 Stage 2
100% (3)
Cambridge DELTA PDA Part1 Stage 2
9 pages
Tectonic Plates Lesson Reflection
100% (2)
Tectonic Plates Lesson Reflection
3 pages
Unit 4 Introduction To Algorithm
No ratings yet
Unit 4 Introduction To Algorithm
10 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
DM 4
No ratings yet
DM 4
76 pages
Lecture Unsupervised (17!04!2024)
No ratings yet
Lecture Unsupervised (17!04!2024)
61 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
ML - Machine Learning PDF
No ratings yet
ML - Machine Learning PDF
13 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
22 pages
K Means
No ratings yet
K Means
9 pages
Clustering
No ratings yet
Clustering
38 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
57 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Clustering
No ratings yet
Clustering
44 pages
Lect 6 - Clustering
No ratings yet
Lect 6 - Clustering
50 pages
Week6 Clustering Regression
No ratings yet
Week6 Clustering Regression
101 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Week 10
No ratings yet
Week 10
50 pages
Clustering
No ratings yet
Clustering
84 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
(KtabPDF Com) xrwA7TEBGp
No ratings yet
(KtabPDF Com) xrwA7TEBGp
32 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Chapter 8
No ratings yet
Chapter 8
15 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
ML Lecture06 Unsupervised Learning
No ratings yet
ML Lecture06 Unsupervised Learning
87 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
84 pages
Outline: Three Basic Algorithms
No ratings yet
Outline: Three Basic Algorithms
34 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Unit 4-Unsupervised Learning-K Means and Hierarchical Clustering
No ratings yet
Unit 4-Unsupervised Learning-K Means and Hierarchical Clustering
48 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
31 pages
R20 Machine Learning Unit 4
No ratings yet
R20 Machine Learning Unit 4
49 pages
ML Unit4
No ratings yet
ML Unit4
19 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
ML 7th Sem AIML ITE Notes Complete LONG (1) - 155-202
No ratings yet
ML 7th Sem AIML ITE Notes Complete LONG (1) - 155-202
48 pages
Unit 4
No ratings yet
Unit 4
53 pages
ML Unit 4
No ratings yet
ML Unit 4
110 pages
UNIT5
No ratings yet
UNIT5
60 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
w6 Clustering
No ratings yet
w6 Clustering
29 pages
Week 14 and 15 Machine Learning Unsupervised 2
No ratings yet
Week 14 and 15 Machine Learning Unsupervised 2
25 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
20 pages
IDS26 Clustering and Classification
No ratings yet
IDS26 Clustering and Classification
30 pages
Clustering
No ratings yet
Clustering
7 pages
Unit 4 Mining
No ratings yet
Unit 4 Mining
12 pages
Clustering Techniques for Analysts
No ratings yet
Clustering Techniques for Analysts
7 pages
Week 9
No ratings yet
Week 9
66 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Lec09 Clustering
No ratings yet
Lec09 Clustering
27 pages
6 - Into To Data Science Techniques and Clustering
No ratings yet
6 - Into To Data Science Techniques and Clustering
16 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
89 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
79 pages
ML Clustering2
No ratings yet
ML Clustering2
11 pages
Artificial Intelligence Lec 5
No ratings yet
Artificial Intelligence Lec 5
20 pages
Data Mining With Clustering: Dr. Mahesh Fernando
No ratings yet
Data Mining With Clustering: Dr. Mahesh Fernando
55 pages
02 - Clustering
No ratings yet
02 - Clustering
43 pages
MLT Unit 1 Vaishali
No ratings yet
MLT Unit 1 Vaishali
44 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Unit 4
No ratings yet
Unit 4
74 pages
Clustering
No ratings yet
Clustering
104 pages
ML 1 Kinematics
No ratings yet
ML 1 Kinematics
49 pages
SF Inorganic Chemistry UFAZ
No ratings yet
SF Inorganic Chemistry UFAZ
100 pages
VB-Chapter X - Solid State Chemistry
No ratings yet
VB-Chapter X - Solid State Chemistry
40 pages
ML 6 Earth Motion
No ratings yet
ML 6 Earth Motion
63 pages
Ex.2, - Group I, II + B, Al
No ratings yet
Ex.2, - Group I, II + B, Al
2 pages
EX 5 (Solid State Cherm + Coord. Chem) - 2023
No ratings yet
EX 5 (Solid State Cherm + Coord. Chem) - 2023
2 pages
ML 4 1 Rotational Motion
No ratings yet
ML 4 1 Rotational Motion
39 pages
ML 4 2 Gravitation
No ratings yet
ML 4 2 Gravitation
11 pages
ML 8 Earth Mechs
No ratings yet
ML 8 Earth Mechs
67 pages
ML 3 Work Power Energy
No ratings yet
ML 3 Work Power Energy
37 pages
EX Session3 Corrections
No ratings yet
EX Session3 Corrections
9 pages
Integration HW5
No ratings yet
Integration HW5
1 page
EX 1 Electric Circuits Basics
No ratings yet
EX 1 Electric Circuits Basics
2 pages
Electromagnetism Part 2
No ratings yet
Electromagnetism Part 2
56 pages
Chapter 5
No ratings yet
Chapter 5
22 pages
Association Rules
No ratings yet
Association Rules
33 pages
Linear Regression
No ratings yet
Linear Regression
64 pages
Chapter 1
No ratings yet
Chapter 1
13 pages
VikasSwaroop Resume AI
No ratings yet
VikasSwaroop Resume AI
1 page
Tle9 Week 3 4
No ratings yet
Tle9 Week 3 4
4 pages
Knowledge Management Notes
50% (2)
Knowledge Management Notes
30 pages
Sample WAP For School Heads
No ratings yet
Sample WAP For School Heads
8 pages
GE2 Module 1 and 2
No ratings yet
GE2 Module 1 and 2
4 pages
French Vacation Lesson Plan
No ratings yet
French Vacation Lesson Plan
6 pages
Summer Camp 2025
No ratings yet
Summer Camp 2025
1 page
C Areer Objective:: Sumera Khanum
No ratings yet
C Areer Objective:: Sumera Khanum
2 pages
Self-Learning Modules - EEnglish-7-Q4-M2
No ratings yet
Self-Learning Modules - EEnglish-7-Q4-M2
10 pages
Aptd Sample Questions
No ratings yet
Aptd Sample Questions
5 pages
Psyc 5121 Mo
No ratings yet
Psyc 5121 Mo
23 pages
GECC1132 2024 SKK-Tagged
No ratings yet
GECC1132 2024 SKK-Tagged
35 pages
Simples Rules 4 Teacher
No ratings yet
Simples Rules 4 Teacher
3 pages
Current Issues and Enduring Questions: A Guide To Critical Thinking and
No ratings yet
Current Issues and Enduring Questions: A Guide To Critical Thinking and
406 pages
Vygotsky's Socio-Cultural Theory
No ratings yet
Vygotsky's Socio-Cultural Theory
3 pages
DLP Acceleration 1
No ratings yet
DLP Acceleration 1
1 page
PST210G
No ratings yet
PST210G
12 pages
Marco Baltazar - Reference Letter From Bus 100 Supervisor
No ratings yet
Marco Baltazar - Reference Letter From Bus 100 Supervisor
1 page
Test Design for Educators
100% (1)
Test Design for Educators
53 pages
Module1 LESSON
No ratings yet
Module1 LESSON
5 pages
Psycholinguistics in Language Teaching
No ratings yet
Psycholinguistics in Language Teaching
11 pages
General Banking Practices Laxmi Sunrise Bank Presentation
No ratings yet
General Banking Practices Laxmi Sunrise Bank Presentation
10 pages
Engl 481 Planning Commentary
No ratings yet
Engl 481 Planning Commentary
26 pages
Microteaching Lesson Idea Template
No ratings yet
Microteaching Lesson Idea Template
2 pages
Learning Vocabulary in Another Language 1st Edition I. S. P. Nation Download
No ratings yet
Learning Vocabulary in Another Language 1st Edition I. S. P. Nation Download
58 pages
Advantages and Disadvantages of Browsing in Internet
No ratings yet
Advantages and Disadvantages of Browsing in Internet
9 pages
Effects of E-Learning On Students Motivation
No ratings yet
Effects of E-Learning On Students Motivation
9 pages