Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
f25a0eb
First cut at basic single linkage internals
lmcinnes Jul 15, 2017
2ed4799
Refer to correct dist_metrics package
lmcinnes Jul 15, 2017
acfbddf
Add csgraph sparse implementation for single linkage
lmcinnes Jul 15, 2017
2d5a95e
Add fast labelling/conversion from MST to single linkage tree; remove…
lmcinnes Jul 15, 2017
b5fa65b
Ensure existing tests cover single linkage
lmcinnes Jul 15, 2017
2d25d1c
Name cingle linkage labelling correctly.
lmcinnes Jul 15, 2017
0a14920
Iterating toward correct solution. Still have to get n_clusters, comp…
lmcinnes Jul 15, 2017
71a3c98
Get n_components correct.
lmcinnes Jul 15, 2017
801ffa1
Update docstrings.
lmcinnes Jul 15, 2017
c84496f
Fix the parents array when we don't get the "full tree"
lmcinnes Jul 15, 2017
8b291ad
Add single linkage to agglomerative clustering example.
lmcinnes Jul 15, 2017
fc97792
Add single linkage to digits agglomerative clustering example.
lmcinnes Jul 15, 2017
b187fb5
Update documentation to reflect the addition of single linkage.
lmcinnes Jul 15, 2017
aa50b07
Update documentation to reflect the addition of single linkage.
lmcinnes Jul 15, 2017
5d838bc
Pep8 fix for class declaration in cython
lmcinnes Jul 15, 2017
b5ba340
Fix heading in clustering docs
lmcinnes Jul 15, 2017
67e63a1
Update the digits clustering text to reflect the new reality.
lmcinnes Jul 15, 2017
73b8f4c
Provide a more complete comparison of the different linkage methods, …
lmcinnes Jul 15, 2017
2895849
We don't need connectivity here, and we can ignore issues with warnin…
lmcinnes Jul 15, 2017
3fc770f
Add an explicit test that single linkage successfully works on exampl…
lmcinnes Jul 15, 2017
c83c896
Update docs with a more complete comparison on linkage methods (scale…
lmcinnes Jul 15, 2017
e9234be
List formatting in example linkage comparison.
lmcinnes Jul 15, 2017
3e1017e
Flake8 fixes.
lmcinnes Jul 16, 2017
9ec7534
Flake8 fixes.
lmcinnes Jul 16, 2017
f5b9077
More Flake8 fixes.
lmcinnes Jul 16, 2017
345ddd7
Fix agglomerative plot example with correct subplot spec
lmcinnes Jul 16, 2017
d0f709b
Explicitly test linkages (including single) produce results identical…
lmcinnes Jul 16, 2017
3eed324
Fix comment on why we sort (consistency)
lmcinnes Jul 16, 2017
0e1b511
Merge branch 'master' into single_linkage_clustering
lmcinnes Jul 24, 2017
55f4d72
Fix indentation issue on line 799
lmcinnes Nov 23, 2017
d6d6e65
Docstring for single_linkage_label
lmcinnes Nov 23, 2017
a0613eb
Various fixes for jnothman's detailed comments.
lmcinnes Nov 28, 2017
5f9207e
Merge branch 'master' into single_linkage_clustering
lmcinnes Nov 28, 2017
6f8af80
Further corrections in cython (memoryviews all around in UnionFind)
lmcinnes Nov 28, 2017
627eed3
Update WhatsNew for single linkage clustering.
lmcinnes Nov 28, 2017
47f7e96
Merge branch 'master' into single_linkage_clustering
lmcinnes Dec 12, 2017
c6eaf47
Resync with master to get doc fixes
lmcinnes Dec 13, 2017
d5ffddd
Merge remote-tracking branch 'origin/single_linkage_clustering' into …
lmcinnes Dec 13, 2017
b737aac
Address Jake's concerns.
lmcinnes Jan 16, 2018
2a3e59c
Merge branch 'master' into single_linkage_clustering
lmcinnes Jan 16, 2018
3a8d505
Handle true zero distances by setting them to "epsilon" distances
lmcinnes Jan 16, 2018
0cca718
Merge remote-tracking branch 'origin/single_linkage_clustering' into …
lmcinnes Jan 16, 2018
cb35449
Missed the memory view direct assignment fix.
lmcinnes Jan 16, 2018
b9c23e1
Missed .data in array fancy indexing for epsilon in place of zero val…
lmcinnes Jan 16, 2018
276d265
Add test for identical points messing with sparse linkage clustering.
lmcinnes Jan 17, 2018
d33db41
Missing comma in test data declaration
lmcinnes Jan 17, 2018
f8b818e
Merge branch 'master' into single_linkage_clustering
lmcinnes Jan 17, 2018
7bbaf7f
Correct arguments to _fix_connectivity
lmcinnes Jan 17, 2018
1ec7beb
Flake8 fixes for new test.
lmcinnes Jan 17, 2018
cbd9b80
More flake8 fixes for new test.
lmcinnes Jan 17, 2018
219b2e5
More flake8 fixes for new test.
lmcinnes Jan 18, 2018
239e8f8
Test all the linkage methods for identical point issues
lmcinnes Jan 18, 2018
5e4c22d
Remove comment; fix epsilon values
lmcinnes Jan 18, 2018
7aae411
Cast precomputed distances to float64 for consistency
lmcinnes Jan 18, 2018
3d42400
Turn bounds checking off; add docsting warning.
lmcinnes Jan 18, 2018
df6b9ce
Function spacing formatting issue
lmcinnes Jan 18, 2018
8e0b38c
Make public and private versions of labelling.
lmcinnes Jan 18, 2018
5abc614
more efficient is sorted check
lmcinnes Jan 18, 2018
3f73d98
Explicit cast to cover all bases
lmcinnes Jan 18, 2018
9bb2355
Address various issue in documentation and examples.
lmcinnes Jan 20, 2018
66571ec
COSMIT: cosmetic changes
GaelVaroquaux Jan 22, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 14 additions & 19 deletions doc/modules/clustering.rst
Original file line number Diff line number Diff line change
Expand Up @@ -567,30 +567,24 @@ considers at each step all the possible merges.
number of features. It is a dimensionality reduction tool, see
:ref:`data_reduction`.

Different linkage type: Ward, complete and average linkage
-----------------------------------------------------------
Different linkage type: Ward, complete, average, and single linkage
-----------------------------------------------------------------)-

:class:`AgglomerativeClustering` supports Ward, average, and complete
:class:`AgglomerativeClustering` supports Ward, single, average, and complete
linkage strategies.

.. image:: ../auto_examples/cluster/images/sphx_glr_plot_digits_linkage_001.png
:target: ../auto_examples/cluster/plot_digits_linkage.html
.. image:: ../auto_examples/cluster/images/sphx_glr_plot_linkage_comparison_001.png
:target: ../auto_examples/cluster/plot_linkage_comparison.html
:scale: 43

.. image:: ../auto_examples/cluster/images/sphx_glr_plot_digits_linkage_002.png
:target: ../auto_examples/cluster/plot_digits_linkage.html
:scale: 43

.. image:: ../auto_examples/cluster/images/sphx_glr_plot_digits_linkage_003.png
:target: ../auto_examples/cluster/plot_digits_linkage.html
:scale: 43


Agglomerative cluster has a "rich get richer" behavior that leads to
uneven cluster sizes. In this regard, complete linkage is the worst
uneven cluster sizes. In this regard, single linkage is the worst
strategy, and Ward gives the most regular sizes. However, the affinity
(or distance used in clustering) cannot be varied with Ward, thus for non
Euclidean metrics, average linkage is a good alternative.
Euclidean metrics, average linkage is a good alternative. Single linkage,
while not robust to noisy data, can be computed very efficiently and can
therefore be useful to provide hierarchical clustering of larger datasets.
Single linkage can also perform well on non-globular data.

.. topic:: Examples:

Expand Down Expand Up @@ -652,15 +646,16 @@ enable only merging of neighboring pixels on an image, as in the

* :ref:`sphx_glr_auto_examples_cluster_plot_agglomerative_clustering.py`

.. warning:: **Connectivity constraints with average and complete linkage**
.. warning:: **Connectivity constraints with single, average and complete linkage**

Connectivity constraints and complete or average linkage can enhance
Connectivity constraints and single, complete or average linkage can enhance
the 'rich getting richer' aspect of agglomerative clustering,
particularly so if they are built with
:func:`sklearn.neighbors.kneighbors_graph`. In the limit of a small
number of clusters, they tend to give a few macroscopically occupied
clusters and almost empty ones. (see the discussion in
:ref:`sphx_glr_auto_examples_cluster_plot_agglomerative_clustering.py`).
Single linkage is the most brittle linkage option with regard to this issue.

.. image:: ../auto_examples/cluster/images/sphx_glr_plot_agglomerative_clustering_001.png
:target: ../auto_examples/cluster/plot_agglomerative_clustering.html
Expand All @@ -682,7 +677,7 @@ enable only merging of neighboring pixels on an image, as in the
Varying the metric
-------------------

Average and complete linkage can be used with a variety of distances (or
Single, average and complete linkage can be used with a variety of distances (or
affinities), in particular Euclidean distance (*l2*), Manhattan distance
(or Cityblock, or *l1*), cosine distance, or any precomputed affinity
matrix.
Expand Down
11 changes: 6 additions & 5 deletions doc/whats_new/v0.20.rst
Original file line number Diff line number Diff line change
Expand Up @@ -78,14 +78,15 @@ Model evaluation
- Added the :func:`metrics.balanced_accuracy_score` metric and a corresponding
``'balanced_accuracy'`` scorer for binary classification.
:issue:`8066` by :user:`xyguo` and :user:`Aman Dalmia <dalmia>`.

- Added :class:`multioutput.RegressorChain` for multi-target
regression. :issue:`9257` by :user:`Kumar Ashutosh <thechargedneutron>`.

- Added the :class:`preprocessing.TransformedTargetRegressor` which transforms
the target y before fitting a regression model. The predictions are mapped
back to the original space via an inverse transform. :issue:`9041` by
`Andreas Müller`_ and :user:`Guillaume Lemaitre <glemaitre>`.
Clustering

- :class:`cluster.AgglomerativeClustering` now supports Single Linkage
clustering via ``linkage='single'``. :issue:`9372` by
:user:`Leland McInnes <lmcinnes>` and :user:`Steve Astels <sastels>`.


Enhancements
............
Expand Down
32 changes: 18 additions & 14 deletions examples/cluster/plot_agglomerative_clustering.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,18 @@
Two consequences of imposing a connectivity can be seen. First clustering
with a connectivity matrix is much faster.

Second, when using a connectivity matrix, average and complete linkage are
unstable and tend to create a few clusters that grow very quickly. Indeed,
average and complete linkage fight this percolation behavior by considering all
the distances between two clusters when merging them. The connectivity
graph breaks this mechanism. This effect is more pronounced for very
sparse graphs (try decreasing the number of neighbors in
kneighbors_graph) and with complete linkage. In particular, having a very
small number of neighbors in the graph, imposes a geometry that is
close to that of single linkage, which is well known to have this
percolation instability.
"""
Second, when using a connectivity matrix, single, average and complete
linkage are unstable and tend to create a few clusters that grow very
quickly. Indeed, average and complete linkage fight this percolation behavior
by considering all the distances between two clusters when merging them (
while single linkage exaggerates the behaviour by considering only the
shortest distance between clusters). The connectivity graph breaks this
mechanism for average and complete linkage, making them resemble the more
brittle single linkage. This effect is more pronounced for very sparse graphs
(try decreasing the number of neighbors in kneighbors_graph) and with
complete linkage. In particular, having a very small number of neighbors in
the graph, imposes a geometry that is close to that of single linkage,
which is well known to have this percolation instability. """
# Authors: Gael Varoquaux, Nelle Varoquaux
# License: BSD 3 clause

Expand Down Expand Up @@ -52,8 +53,11 @@
for connectivity in (None, knn_graph):
for n_clusters in (30, 3):
plt.figure(figsize=(10, 4))
for index, linkage in enumerate(('average', 'complete', 'ward')):
plt.subplot(1, 3, index + 1)
for index, linkage in enumerate(('average',
'complete',
'ward',
'single')):
plt.subplot(1, 4, index + 1)
model = AgglomerativeClustering(linkage=linkage,
connectivity=connectivity,
n_clusters=n_clusters)
Expand All @@ -62,7 +66,7 @@
elapsed_time = time.time() - t0
plt.scatter(X[:, 0], X[:, 1], c=model.labels_,
cmap=plt.cm.spectral)
plt.title('linkage=%s (time %.2fs)' % (linkage, elapsed_time),
plt.title('linkage=%s\n(time %.2fs)' % (linkage, elapsed_time),
fontdict=dict(verticalalignment='top'))
plt.axis('equal')
plt.axis('off')
Expand Down
12 changes: 7 additions & 5 deletions examples/cluster/plot_digits_linkage.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,10 @@

What this example shows us is the behavior "rich getting richer" of
agglomerative clustering that tends to create uneven cluster sizes.
This behavior is especially pronounced for the average linkage strategy,
that ends up with a couple of singleton clusters.
This behavior is pronounced for the average linkage strategy,
that ends up with a couple of singleton clusters, while in the case
of single linkage we get a single central cluster with all other clusters
being drawn from noise points around the fringes.
"""

# Authors: Gael Varoquaux
Expand Down Expand Up @@ -69,7 +71,7 @@ def plot_clustering(X_red, X, labels, title=None):
if title is not None:
plt.title(title, size=17)
plt.axis('off')
plt.tight_layout()
plt.tight_layout(rect=[0, 0.03, 1, 0.95])

#----------------------------------------------------------------------
# 2D embedding of the digits dataset
Expand All @@ -79,11 +81,11 @@ def plot_clustering(X_red, X, labels, title=None):

from sklearn.cluster import AgglomerativeClustering

for linkage in ('ward', 'average', 'complete'):
for linkage in ('ward', 'average', 'complete', 'single'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the added subplot, the figure got a bit more narrow and the titles are not well separated. I think that it would be useful to add a "\n" in the title between the name of the linkage and the timing.

Copy link
Contributor Author

@lmcinnes lmcinnes Jan 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I am not quite clear exactly what you would like (certainly the titles on the plots are a little tight). I've made some adjustments, but would welcome any further clarification as I suspect I am missing something here.

Edit: Ah -- you are referring to examples/cluster/plot_agglomerative_clustering.py I suspect. I can certainly fix that.

clustering = AgglomerativeClustering(linkage=linkage, n_clusters=10)
t0 = time()
clustering.fit(X_red)
print("%s : %.2fs" % (linkage, time() - t0))
print("%s :\t%.2fs" % (linkage, time() - t0))

plot_clustering(X_red, X, clustering.labels_, "%s linkage" % linkage)

Expand Down
149 changes: 149 additions & 0 deletions examples/cluster/plot_linkage_comparison.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
"""
================================================================
Comparing different hierarchical linkage methods on toy datasets
================================================================

This example shows characteristics of different linkage
methods for hierarchical clustering on datasets that are
"interesting" but still in 2D.

The main observations to make are:

- single linkage is fast, and can perform well on
non-globular data, but it performs poorly in the
presence of noise.
- average and complete linkage perform well on
cleanly separated globular clusters, but have mixed
results otherwise.
- Ward is the most effective method for noisy data.

While these examples give some intuition about the
algorithms, this intuition might not apply to very high
dimensional data.
"""
print(__doc__)

import time
import warnings

import numpy as np
import matplotlib.pyplot as plt

from sklearn import cluster, datasets
from sklearn.preprocessing import StandardScaler
from itertools import cycle, islice

np.random.seed(0)

######################################################################
# Generate datasets. We choose the size big enough to see the scalability
# of the algorithms, but not too big to avoid too long running times

n_samples = 1500
noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
noise=.05)
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)
blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
no_structure = np.random.rand(n_samples, 2), None

# Anisotropicly distributed data
random_state = 170
X, y = datasets.make_blobs(n_samples=n_samples, random_state=random_state)
transformation = [[0.6, -0.6], [-0.4, 0.8]]
X_aniso = np.dot(X, transformation)
aniso = (X_aniso, y)

# blobs with varied variances
varied = datasets.make_blobs(n_samples=n_samples,
cluster_std=[1.0, 2.5, 0.5],
random_state=random_state)

######################################################################
# Run the clustering and plot

# Set up cluster parameters
plt.figure(figsize=(9 * 1.3 + 2, 14.5))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,
hspace=.01)

plot_num = 1

default_base = {'n_neighbors': 10,
'n_clusters': 3}

datasets = [
(noisy_circles, {'n_clusters': 2}),
(noisy_moons, {'n_clusters': 2}),
(varied, {'n_neighbors': 2}),
(aniso, {'n_neighbors': 2}),
(blobs, {}),
(no_structure, {})]

for i_dataset, (dataset, algo_params) in enumerate(datasets):
# update parameters with dataset-specific values
params = default_base.copy()
params.update(algo_params)

X, y = dataset

# normalize dataset for easier parameter selection
X = StandardScaler().fit_transform(X)

# ============
# Create cluster objects
# ============
ward = cluster.AgglomerativeClustering(
n_clusters=params['n_clusters'], linkage='ward')
complete = cluster.AgglomerativeClustering(
n_clusters=params['n_clusters'], linkage='complete')
average = cluster.AgglomerativeClustering(
n_clusters=params['n_clusters'], linkage='average')
single = cluster.AgglomerativeClustering(
n_clusters=params['n_clusters'], linkage='single')

clustering_algorithms = (
('Single Linkage', single),
('Average Linkage', average),
('Complete Linkage', complete),
('Ward Linkage', ward),
)

for name, algorithm in clustering_algorithms:
t0 = time.time()

# catch warnings related to kneighbors_graph
with warnings.catch_warnings():
warnings.filterwarnings(
"ignore",
message="the number of connected components of the " +
"connectivity matrix is [0-9]{1,2}" +
" > 1. Completing it to avoid stopping the tree early.",
category=UserWarning)
algorithm.fit(X)

t1 = time.time()
if hasattr(algorithm, 'labels_'):
y_pred = algorithm.labels_.astype(np.int)
else:
y_pred = algorithm.predict(X)

plt.subplot(len(datasets), len(clustering_algorithms), plot_num)
if i_dataset == 0:
plt.title(name, size=18)

colors = np.array(list(islice(cycle(['#377eb8', '#ff7f00', '#4daf4a',
'#f781bf', '#a65628', '#984ea3',
'#999999', '#e41a1c', '#dede00']),
int(max(y_pred) + 1))))
plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[y_pred])

plt.xlim(-2.5, 2.5)
plt.ylim(-2.5, 2.5)
plt.xticks(())
plt.yticks(())
plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'),
transform=plt.gca().transAxes, size=15,
horizontalalignment='right')
plot_num += 1

plt.show()
Loading