11/22/2023
Machine Learning is…
… learning from data
… on its own
145
Machine Learning is…
… learning from data
… on its own
… discovering hidden patterns
146
1
11/22/2023
Machine Learning is…
… learning from data
… on its own
… discovering hidden patterns
… data-driven decisions
147
Supervised Learning
Purpose
Given a dataset {(x i ,yi ) ∈ X × Y, i = 1,...,N}, learn the
dependancies between X and Y.
► Example: Learn the links between cardiac risk and food
habits. x i is one person describe by d features concerning its
food habits; yi is a binary category (risky, not risky).
► yi are essential for the learning process.
► Methods : K-Nearest Neighbors, SVM, Decision Tree, . . .
148
2
11/22/2023
Unsupervised Learning
Purpose
From observations {x i ∈ X ,i = 1,...,N}, learn the organisation
of X and discover homogen subsets.
► Example: Categorize customers. x i encodes a customer with
features encoding its social condition and behavior.
► Methods: Hierarichal clustering, K-Means, Reinforcement
learning, . . .
149
Semi supervised Learning
Purpose
Within a dataset, only a small part of sample have a corresponding
label, i.e. {(x 1, y1), ···,(x k ,yk ), x k+1, ···,N}. The goal is to
infer the classes of unlabeled data.
► Example: Filter webpages. Number of webpages is
tremendous, only few of them can be labeled by an expert.
► Methods: Bayesian methods, SVM, Graph Neural Networks,
...
150
3
11/22/2023
Supervised vs.
Unsupervised
•Supervised Approaches
• Target (what model is predicting) is provided
• ‘Labeled’ data
• Classification & regression are supervised.
•Unsupervised Approaches
• Target is unknown or unavailable
• ‘unlabeled’ data
• Cluster analysis & association analysis are
unsupervised.
151
Categories of Machine Learning Techniques
Supervised Unsupervised
(target is available) (target is not available)
Classification Cluster
Analysis
Regression Association
Analysis
152
4
11/22/2023
Classification Goal: Predict category
Sunny
Windy
Rainy
Cloudy
Image source:
http://www.davidson.k12.nc.us/parents students/inclement_weather
153
Regression
Predict numeric value
Goal:
154
5
11/22/2023
Goal: Organize similar
Cluster Analysis
items into groups.
Seniors
Adults
Teenagers
Image source: http://www.monetate.com/blog/the-intrinsic-value-of-customer-
segmentation
155
Association Analysis
Goal: Find rules to capture
associations between items.
156
6
11/22/2023
scikit-learn
• Open source library for Machine Learning in
Python
• Built on top of NumPy, SciPy, matplotlib
• Active community for development
• Improved continuously by developers
157
Preprocessing Tools
•Utility Functions for
• Transforming raw feature vectors to suitable format
•Provides API for
• Scaling of features: remove mean and keep unit variance
• Normalization to have unit norm
• Binarization to turn data into 0 or 1 format
• One Hot Encoding for categorical features
• Handling of missing values
• Generating higher order features
• Build custom transformations
158
7
11/22/2023
Different Tasks
► Supervised Learning
► Unsupervised Learning
► Semi Supervised Learning
159
Provides organized tutorials with specifics.
http://scikit-learn.org/stable/documentation.html
160
8
11/22/2023
Dimensionality Reduction
• Enables you to reduce features while preserving variance
• scikit-learn has capabilities for:
• Principal Component Analysis (PCA)
• Singular Value Decomposition
• Factor Analysis
• Independent Component Analysis
• Matrix Factorization
• Latent Dirichlet Allocation
161
Model Selection
• Provides methods for Cross Validation
• Library functions for tuning hyper parameters
• Model Evaluation mechanisms to measure model performance
• Plotting methods for visualizing scores to evaluate models
162
9
11/22/2023
Summary of scikit-learn
• Extensive set of tools for full pipeline in Machine Learning
• Dependable due to community support
• Provides easy to use API for training, and making predictions
• Collection of the best, most popular, algorithms in one place
163
Clustering
164
10
11/22/2023
Clustering http://scikit-learn.org/stable/modules/clustering.html#clustering
• sklearn.cluster gives algorithms for grouping of unlabeled data
165
Cluster Analysis Overview
Goal: Organize similar items into groups
166
11
11/22/2023
Cluster Analysis Examples
• Segment customer base into groups
• Characterize different weather patterns for a region
• Group news articles into topics
• Discover crime hot spots
• NLP: Find set of texts
• Documents: Automatic classification (Driver License,
ID, Passport)
• Marketing: Client profiles
167
Cluster Analysis
• Divides data into clusters
• Similar items are placed in same cluster
Intra-cluster
differences are
minimized
Inter-cluster differences are
v maximized
168
12
11/22/2023
Distance – main focus
Euclidean Distance
169
Distance – other methods
A A
B B
Cosine Similarity
Manhattan Distance
► Minkoswski Distance (norm p)
► Manhattan distance (p = 1)
170
13
11/22/2023
Distance dm(x 1, x 2) I
► Euclidean distance (p = 2)
171
Distance dm(x 1, x 2) II
► Matrix based distance, W sdp
► Mahalanobis distance W = C −1 with C covariance matrix.
172
14
11/22/2023
Distance between discrete values
► Let x1 ∈ { c1 , . . . , ck } et x 2 ∈ { d1 , . . . , dh }
► Contingency table A(x 1 , x2 ) = [aij ]
► aij : times when x1 = ci AND x2 = dj
► Hamming Distance: sum when vectors differ
|A∪B |−|A∩B |
► Jaccard : dJ (A, B) = |A∪B |
173
Distance properties
Four properties of a metric
dm : X × X → [0, inf )
1. Non-negativity : dm (x, y) ≥ 0
2. Symmetry : dm (x, y) = dist(y, x)
3. Identity : dm (x, y) = 0 ⇔x = y
4. Triange inequality : dm(x, y) ≤ dm(x, z) + dm(z, y)
174
15
11/22/2023
Distance between Clusters
How to estimate dm (C1, C2) ?
175
Illustration
Single Linkage Complete Linkage
Average Linkage Centers of gravity
176
16
11/22/2023
How to evaluate the quality of a
clustering ?
177
How to evaluate the quality of a
clustering ?
error = distance between sample & centroid
X squared error = error2
Sum of squared errors between all
samples & centroid
Sum over all clusters WSSE
Within-Cluster Sum of Squared Error
= Intra Cluster Inertia Jw
178
17
11/22/2023
How to evaluate the quality of a
clustering ?
WSSE1 < WSSE2 WSSE1 is better numerically
Caveats:
• Does not mean that cluster set 1 is
more ‘correct’ than cluster set 2
• Larger values for k will always reduce
WSSE
179
How to evaluate the quality of a
clustering ?
180
18
11/22/2023
A Good
Clustering
C1
g1
C3
g3
C2
g g
g4 C4
g2
Total Inertia = Intra Cluster Inertia + Inter Cluster Inertia
Good Partition ?
Minimise Intra cluster Inertia and Maximise Inter Cluster
Inertia
181
A Good
Partition
High Inter Cluster Inertia Low Inter Cluster Inertia
Low Intra Cluster Inertia High Intra Cluster Inertia
g3
g1 g2
g4
182
19
11/22/2023
Terms used: Similarity and
Dissimilarity
► Dissimilarity dm: small value → points are close (e.g.
distance)
dm(x, z) = ǁx − zǁ 22
► Similarity sm : big value → points are close (e.g. RBF)
ǁx − zǁ 2
sm (x, z) = exp−
σ
183
Normalizing Input Variables
Scaled Values
Weight
Height
184
20
11/22/2023
Cluster Analysis Notes
Unsupervised
There is no ‘correct’
clustering
Clusters don’t come
with labels
Interpretation and analysis required to
make sense of clustering results!
185
Uses of Cluster Results
• Data segmentation
• Analysis of each segment can provide insights
science fiction
non-fiction
children’s
186
21
11/22/2023
Uses of Cluster Results
• Categories for classifying new data
• New sample assigned to closest cluster
• Label of closest cluster used to
classify new sample
187
Uses of Cluster Results
• Labeled data for classification
• Cluster samples used as labeled data
Labeled samples
for science fiction
customers
188
22
11/22/2023
Uses of Cluster Results
• Basis for anomaly detection
• Cluster outliers are anomalies
Anomalies that
require further
v analysis
189
• Organize similar items into groups
• Analyzing clusters often leads to useful
insights about data
• Clusters require analysis and interpretation
190
23
11/22/2023
Questions
raised:
► Data Nature: Binary, texts, numeric, trees, . . .
► Similarity between data
► What is a cluster ?
► What is a good cluster ?
► How many clusters ?
► Which algorithm ?
► Evaluation of clustering results
191
Clustering
Methods
► Many methods exist . . .
► Hierarchical Clustering
► Agglomerative Clustering
► Distances used
► Agglomeration strategies
► Splitting Clustering
► Kmeans and derivatives
► DBSCAN
► Spectral Clustering
► ...
► Modelisation Clustering
► Gaussian Mixtures models
► One Class SVM
192
24
11/22/2023
k-Means Clustering
193
Cluster Analysis
• Divides data into clusters
• Similar items are in same cluster
Intra-cluster
differences are
minimized
Inter-cluster differences are
maximized
194
25
11/22/2023
k-Means Algorithm
Select k initial centroids (cluster centers)
Repeat
Assign each sample to closest centroid
Calculate mean of cluster to determine new centroid
Until some stopping criterion is reached
centroid
X
195
K-means for Clustering
Purpose
►D = { xi ∈ Rd } i =1,···,N
► Clustering in K < N clusters Ck
Brute Force
1. Build all possible partitions
2. Evaluate each clustering et keep the best clustering
Problem
Number of possible clusterings increases exponentially
For N = 10 and K = 4, we have 34105 possible clusterings !
196
26
11/22/2023
K-means for Clustering
• A better solution
► Minimizing intra-class inertia, w.r.t. µk , k = 1,…,K
► Use of an heuristic: we will have a good clustering but not
necessarily the best one according to Jw
197
K-means for Clustering
A famous algorithm: K-means
1. Consider we have gravity centers µk , k = 1, ···, K
2. we affect each xi to the closest cluster Cl :
3. We recompute µk for each Ck, k = 1, ···, K
4. We continue until we reach convergence
198
27
11/22/2023
K-means algorithm
199
K-Means :
illustration
Clustering in K = 2 clusters
Data Initialisation Iteration 1
La vérité vraie Initialisation Clusters obtenus à l’iteration 1
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2
−3 −2 0 1 2 3 4 5
−2 −2
−1
−4 −2 0 2 4 6 −4 −2 0 2 4 6
Iteration 2 Iteration 3 Iteration 5
Clusters obtenus à l’iteration 2 Clusters obtenus à l’iteration 3 Clusters obtenus à l’iteration 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−4 −2 0 2 4 6 −4 −2 0 2 4 6 −4 −2 0 2 4 6
200
28
11/22/2023
Choosing Initial Centroids
Issue:
Final clusters are sensitive to initial centroids
Solution:
Run k-means multiple times with
different random initial centroids,
and choose best results
201
Choosing Value for k
• Approaches: k=?
• Visualization
• Application-Dependent
• Data-Driven
202
29
11/22/2023
How to choose the number of clusters ?
K clusters
► Hard problem; depends on data
► Fixed a priori by the problematic
► Search for the best partition for different K > 1; Find a break
in Jw (K ) decreasing
► Constrain the density and/or volume of clusters
► Use criteria to evaluate clusterings
► Compute clustering for each K = 1, . . . , Kmax
► Compute criteria J(K )
► Choose K ∗ the K having the best criteria
203
Elbow Method for Choosing k
“Elbow” suggests value for
k should be 3
204
30
11/22/2023
K-Means : Discussion
► Jw decreases at each iteration
► It converges towards a local minimum of Jw
► Quick convergence
► Initialisation of µk :
► Randomly within xi domain
► Randomly K among X
► Different initializations lead to different
clustering
205
Stopping Criteria
X
When to stop iterating?
• No changes to centroids
• Number of samples changing clusters
is below threshold
206
31
11/22/2023
Some
criteria
207
Some
Criteria
208
32
11/22/2023
Interpreting Results
• Examine cluster centroids
• How are clusters different?
X
X Compare centroids
to see how clusters
are different
X
209
K-Means Summary
• Classic algorithm for cluster analysis
• Simple to understand and implement
and is efficient
• Value of k must be specified
• Final clusters are sensitive to initial
centroids
210
33