Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 797fd1a

Browse files
ArturoAmorQbetatimjeremiedbbglemaitre
committed
DOC Rework k-means assumptions example (#24970)
Co-authored-by: Tim Head <[email protected]> Co-authored-by: Jérémie du Boisberranger <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]>
1 parent 020fda0 commit 797fd1a

File tree

2 files changed

+149
-40
lines changed

2 files changed

+149
-40
lines changed

doc/modules/clustering.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,7 @@ It suffers from various drawbacks:
170170
k-means clustering can alleviate this problem and speed up the
171171
computations.
172172

173-
.. image:: ../auto_examples/cluster/images/sphx_glr_plot_kmeans_assumptions_001.png
173+
.. image:: ../auto_examples/cluster/images/sphx_glr_plot_kmeans_assumptions_002.png
174174
:target: ../auto_examples/cluster/plot_kmeans_assumptions.html
175175
:align: center
176176
:scale: 50

examples/cluster/plot_kmeans_assumptions.py

Lines changed: 148 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -3,67 +3,176 @@
33
Demonstration of k-means assumptions
44
====================================
55
6-
This example is meant to illustrate situations where k-means will produce
7-
unintuitive and possibly unexpected clusters. In the first three plots, the
8-
input data does not conform to some implicit assumption that k-means makes and
9-
undesirable clusters are produced as a result. In the last plot, k-means
10-
returns intuitive clusters despite unevenly sized blobs.
6+
This example is meant to illustrate situations where k-means produces
7+
unintuitive and possibly undesirable clusters.
118
129
"""
1310

1411
# Author: Phil Roth <[email protected]>
12+
# Arturo Amor <[email protected]>
1513
# License: BSD 3 clause
1614

17-
import numpy as np
18-
import matplotlib.pyplot as plt
15+
# %%
16+
# Data generation
17+
# ---------------
18+
#
19+
# The function :func:`~sklearn.datasets.make_blobs` generates isotropic
20+
# (spherical) gaussian blobs. To obtain anisotropic (elliptical) gaussian blobs
21+
# one has to define a linear `transformation`.
1922

20-
from sklearn.cluster import KMeans
23+
import numpy as np
2124
from sklearn.datasets import make_blobs
2225

23-
plt.figure(figsize=(12, 12))
24-
2526
n_samples = 1500
2627
random_state = 170
28+
transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
29+
2730
X, y = make_blobs(n_samples=n_samples, random_state=random_state)
31+
X_aniso = np.dot(X, transformation) # Anisotropic blobs
32+
X_varied, y_varied = make_blobs(
33+
n_samples=n_samples, cluster_std=[1.0, 2.5, 0.5], random_state=random_state
34+
) # Unequal variance
35+
X_filtered = np.vstack(
36+
(X[y == 0][:500], X[y == 1][:100], X[y == 2][:10])
37+
) # Unevenly sized blobs
38+
y_filtered = [0] * 500 + [1] * 100 + [2] * 10
2839

29-
# Incorrect number of clusters
30-
y_pred = KMeans(n_clusters=2, n_init="auto", random_state=random_state).fit_predict(X)
40+
# %%
41+
# We can visualize the resulting data:
3142

32-
plt.subplot(221)
33-
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
34-
plt.title("Incorrect Number of Blobs")
43+
import matplotlib.pyplot as plt
3544

36-
# Anisotropicly distributed data
37-
transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
38-
X_aniso = np.dot(X, transformation)
39-
y_pred = KMeans(n_clusters=3, n_init="auto", random_state=random_state).fit_predict(
40-
X_aniso
41-
)
45+
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(12, 12))
4246

43-
plt.subplot(222)
44-
plt.scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)
45-
plt.title("Anisotropicly Distributed Blobs")
47+
axs[0, 0].scatter(X[:, 0], X[:, 1], c=y)
48+
axs[0, 0].set_title("Mixture of Gaussian Blobs")
4649

47-
# Different variance
48-
X_varied, y_varied = make_blobs(
49-
n_samples=n_samples, cluster_std=[1.0, 2.5, 0.5], random_state=random_state
50-
)
51-
y_pred = KMeans(n_clusters=3, n_init="auto", random_state=random_state).fit_predict(
52-
X_varied
53-
)
50+
axs[0, 1].scatter(X_aniso[:, 0], X_aniso[:, 1], c=y)
51+
axs[0, 1].set_title("Anisotropically Distributed Blobs")
52+
53+
axs[1, 0].scatter(X_varied[:, 0], X_varied[:, 1], c=y_varied)
54+
axs[1, 0].set_title("Unequal Variance")
55+
56+
axs[1, 1].scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_filtered)
57+
axs[1, 1].set_title("Unevenly Sized Blobs")
58+
59+
plt.suptitle("Ground truth clusters").set_y(0.95)
60+
plt.show()
61+
62+
# %%
63+
# Fit models and plot results
64+
# ---------------------------
65+
#
66+
# The previously generated data is now used to show how
67+
# :class:`~sklearn.cluster.KMeans` behaves in the following scenarios:
68+
#
69+
# - Non-optimal number of clusters: in a real setting there is no uniquely
70+
# defined **true** number of clusters. An appropriate number of clusters has
71+
# to be decided from data-based criteria and knowledge of the intended goal.
72+
# - Anisotropically distributed blobs: k-means consists of minimizing sample's
73+
# euclidean distances to the centroid of the cluster they are assigned to. As
74+
# a consequence, k-means is more appropriate for clusters that are isotropic
75+
# and normally distributed (i.e. spherical gaussians).
76+
# - Unequal variance: k-means is equivalent to taking the maximum likelihood
77+
# estimator for a "mixture" of k gaussian distributions with the same
78+
# variances but with possibly different means.
79+
# - Unevenly sized blobs: there is no theoretical result about k-means that
80+
# states that it requires similar cluster sizes to perform well, yet
81+
# minimizing euclidean distances does mean that the more sparse and
82+
# high-dimensional the problem is, the higher is the need to run the algorithm
83+
# with different centroid seeds to ensure a global minimal inertia.
5484

55-
plt.subplot(223)
56-
plt.scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)
57-
plt.title("Unequal Variance")
85+
from sklearn.cluster import KMeans
86+
87+
common_params = {
88+
"n_init": "auto",
89+
"random_state": random_state,
90+
}
91+
92+
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(12, 12))
93+
94+
y_pred = KMeans(n_clusters=2, **common_params).fit_predict(X)
95+
axs[0, 0].scatter(X[:, 0], X[:, 1], c=y_pred)
96+
axs[0, 0].set_title("Non-optimal Number of Clusters")
97+
98+
y_pred = KMeans(n_clusters=3, **common_params).fit_predict(X_aniso)
99+
axs[0, 1].scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)
100+
axs[0, 1].set_title("Anisotropically Distributed Blobs")
101+
102+
y_pred = KMeans(n_clusters=3, **common_params).fit_predict(X_varied)
103+
axs[1, 0].scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)
104+
axs[1, 0].set_title("Unequal Variance")
105+
106+
y_pred = KMeans(n_clusters=3, **common_params).fit_predict(X_filtered)
107+
axs[1, 1].scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_pred)
108+
axs[1, 1].set_title("Unevenly Sized Blobs")
109+
110+
plt.suptitle("Unexpected KMeans clusters").set_y(0.95)
111+
plt.show()
112+
113+
# %%
114+
# Possible solutions
115+
# ------------------
116+
#
117+
# For an example on how to find a correct number of blobs, see
118+
# :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py`.
119+
# In this case it suffices to set `n_clusters=3`.
120+
121+
y_pred = KMeans(n_clusters=3, **common_params).fit_predict(X)
122+
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
123+
plt.title("Optimal Number of Clusters")
124+
plt.show()
125+
126+
# %%
127+
# To deal with unevenly sized blobs one can increase the number of random
128+
# initializations. In this case we set `n_init=10` to avoid finding a
129+
# sub-optimal local minimum. For more details see :ref:`kmeans_sparse_high_dim`.
58130

59-
# Unevenly sized blobs
60-
X_filtered = np.vstack((X[y == 0][:500], X[y == 1][:100], X[y == 2][:10]))
61131
y_pred = KMeans(n_clusters=3, n_init=10, random_state=random_state).fit_predict(
62132
X_filtered
63133
)
64-
65-
plt.subplot(224)
66134
plt.scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_pred)
67-
plt.title("Unevenly Sized Blobs")
135+
plt.title("Unevenly Sized Blobs \nwith several initializations")
136+
plt.show()
137+
138+
# %%
139+
# As anisotropic and unequal variances are real limitations of the k-means
140+
# algorithm, here we propose instead the use of
141+
# :class:`~sklearn.mixture.GaussianMixture`, which also assumes gaussian
142+
# clusters but does not impose any constraints on their variances. Notice that
143+
# one still has to find the correct number of blobs (see
144+
# :ref:`sphx_glr_auto_examples_mixture_plot_gmm_selection.py`).
145+
#
146+
# For an example on how other clustering methods deal with anisotropic or
147+
# unequal variance blobs, see the example
148+
# :ref:`sphx_glr_auto_examples_cluster_plot_cluster_comparison.py`.
68149

150+
from sklearn.mixture import GaussianMixture
151+
152+
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
153+
154+
y_pred = GaussianMixture(n_components=3).fit_predict(X_aniso)
155+
ax1.scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)
156+
ax1.set_title("Anisotropically Distributed Blobs")
157+
158+
y_pred = GaussianMixture(n_components=3).fit_predict(X_varied)
159+
ax2.scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)
160+
ax2.set_title("Unequal Variance")
161+
162+
plt.suptitle("Gaussian mixture clusters").set_y(0.95)
69163
plt.show()
164+
165+
# %%
166+
# Final remarks
167+
# -------------
168+
#
169+
# In high-dimensional spaces, Euclidean distances tend to become inflated
170+
# (not shown in this example). Running a dimensionality reduction algorithm
171+
# prior to k-means clustering can alleviate this problem and speed up the
172+
# computations (see the example
173+
# :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`).
174+
#
175+
# In the case where clusters are known to be isotropic, have similar variance
176+
# and are not too sparse, the k-means algorithm is quite effective and is one of
177+
# the fastest clustering algorithms available. This advantage is lost if one has
178+
# to restart it several times to avoid convergence to a local minimum.

0 commit comments

Comments
 (0)