9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook
Clustering
k-means Clustering
https://en.wikipedia.org/wiki/K-means_clustering (https://en.wikipedia.org/wiki/K-means_clustering)
UCI Machine Learning Repository
https://archive.ics.uci.edu/ml/datasets.php
(https://archive.ics.uci.edu/ml/datasets.php)
In [1]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import style
In [2]: style.use('default')
localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 1/11
9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook
In [3]: iris_df = sns.load_dataset('iris')
iris_df
Out[3]: sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
150 rows × 5 columns
In [4]: iris_df['species'].value_counts()
Out[4]: virginica 50
setosa 50
versicolor 50
Name: species, dtype: int64
localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 2/11
9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook
In [5]: plt.figure(figsize = (7,7))
sns.scatterplot(data = iris_df, x = 'sepal_length', y = 'sepal_width', hue = 'spe
Out[5]: <AxesSubplot:xlabel='sepal_length', ylabel='sepal_width'>
localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 3/11
9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook
In [6]: plt.figure(figsize = (7,7))
sns.scatterplot(data = iris_df, x = 'sepal_length', y = 'petal_width', hue = 'spe
Out[6]: <AxesSubplot:xlabel='sepal_length', ylabel='petal_width'>
localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 4/11
9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook
In [7]: plt.figure(figsize = (5,5))
sns.scatterplot(data = iris_df, x = 'petal_length', y = 'petal_width', hue = 'spe
Out[7]: <AxesSubplot:xlabel='petal_length', ylabel='petal_width'>
Raw Coding the k Means Clustering Algorithm
localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 5/11
9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook
In [8]: from sklearn.metrics import pairwise_distances_argmin
def find_clusters(X, n_clusters, rseed=0, num_iter = 100):
# 1. Randomly choose clusters
rng = np.random.RandomState(rseed)
i = rng.permutation(X.shape[0])[:n_clusters]
centers = X[i]
iter = 1
while True:
# 2a. Assign labels based on closest center
labels = pairwise_distances_argmin(X, centers)
# 2b. Find new centers from means of points
new_centers = np.array([X[labels == i].mean(0)
for i in range(n_clusters)])
# 2c. Check for convergence
print(num_iter, iter)
iter +=1
if iter > num_iter:
break
if np.all(centers == new_centers):
break
centers = new_centers
return centers, labels
X = iris_df.iloc[:, :-1].to_numpy()
centers = []
labels = []
for i in [1, 2, 5, 10]:
out_center, out_label = find_clusters(X, 3, num_iter = i, rseed=0)
centers.append(out_center)
labels.append(out_label)
1 1
2 1
2 2
5 1
5 2
5 3
5 4
5 5
10 1
10 2
10 3
10 4
10 5
10 6
10 7
localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 6/11
9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook
10 8
10 9
In [9]: iris_df
Out[9]: sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
150 rows × 5 columns
localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 7/11
9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook
In [10]: import matplotlib.gridspec as gridspec
fig2 = plt.figure(constrained_layout=True, figsize = (7,7))
spec2 = gridspec.GridSpec(ncols=2, nrows=2, figure=fig2)
f2_ax1 = fig2.add_subplot(spec2[0, 0])
sns.scatterplot(ax = f2_ax1, x = iris_df['sepal_length'], y = iris_df['petal_widt
f2_ax1.scatter(centers[0][:, 0], centers[0][:, -1], marker = '*', color = 'royalb
f2_ax1.set_title('Number of Iterations = 1')
f2_ax2 = fig2.add_subplot(spec2[0, 1])
sns.scatterplot(ax = f2_ax2, x = iris_df['sepal_length'], y = iris_df['petal_widt
f2_ax2.scatter(centers[1][:, 0], centers[1][:, -1], marker = '*', color = 'royalb
f2_ax2.set_title('Number of Iterations = 2')
f2_ax3 = fig2.add_subplot(spec2[1, 0])
sns.scatterplot(ax = f2_ax3, x = iris_df['sepal_length'], y = iris_df['petal_widt
f2_ax3.scatter(centers[2][:, 0], centers[2][:, -1], marker = '*', color = 'royalb
f2_ax3.set_title('Number of Iterations = 5')
f2_ax4 = fig2.add_subplot(spec2[1, 1])
sns.scatterplot(ax = f2_ax4, x = iris_df['sepal_length'], y = iris_df['petal_widt
f2_ax4.scatter(centers[3][:, 0], centers[3][:, -1], marker = '*', color = 'royalb
f2_ax4.set_title('Number of Iterations = 10')
fig2.suptitle('Clustering at various iterations')
plt.show()
localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 8/11
9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook
Using sklearn
https://scikit-
learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans
(https://scikit-
learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)
localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 9/11
9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook
In [11]: from sklearn.cluster import KMeans
kmc = KMeans(n_clusters=3, max_iter=600, algorithm = 'full')
X = iris_df.iloc[:, :-1]
kmc.fit(X)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:882: User
Warning: KMeans is known to have a memory leak on Windows with MKL, when there
are less chunks than available threads. You can avoid it by setting the environ
ment variable OMP_NUM_THREADS=1.
f"KMeans is known to have a memory leak on Windows "
Out[11]: KMeans(algorithm='full', max_iter=600, n_clusters=3)
In [12]: kmc.cluster_centers_
Out[12]: array([[5.006 , 3.428 , 1.462 , 0.246 ],
[5.9016129 , 2.7483871 , 4.39354839, 1.43387097],
[6.85 , 3.07368421, 5.74210526, 2.07105263]])
In [13]: kmc.labels_
Out[13]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2,
2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2,
2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1])
In [14]: pd.crosstab(kmc.labels_, iris_df['species'])
Out[14]: species setosa versicolor virginica
row_0
0 50 0 0
1 0 48 14
2 0 2 36
Metrics
https://scikit-learn.org/stable/modules/clustering.html#clustering-evaluation (https://scikit-
learn.org/stable/modules/clustering.html#clustering-evaluation)
localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 10/11
9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook
In [15]: from sklearn.metrics import silhouette_score
cluster_df = pd.DataFrame(kmc.labels_, columns = ['Cluster ID'])
# cluster_df
full_cluster_df = pd.concat([X.reset_index(drop = True), cluster_df], axis = 1)
full_cluster_df
silhouette_score(full_cluster_df, kmc.labels_, metric='euclidean')
Out[15]: 0.6128676734836785
In [ ]:
localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 11/11