-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
[MRG+2] Single linkage clustering #9372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
GaelVaroquaux
merged 61 commits into
scikit-learn:master
from
lmcinnes:single_linkage_clustering
Jan 22, 2018
Merged
Changes from all commits
Commits
Show all changes
61 commits
Select commit
Hold shift + click to select a range
f25a0eb
First cut at basic single linkage internals
lmcinnes 2ed4799
Refer to correct dist_metrics package
lmcinnes acfbddf
Add csgraph sparse implementation for single linkage
lmcinnes 2d5a95e
Add fast labelling/conversion from MST to single linkage tree; remove…
lmcinnes b5fa65b
Ensure existing tests cover single linkage
lmcinnes 2d25d1c
Name cingle linkage labelling correctly.
lmcinnes 0a14920
Iterating toward correct solution. Still have to get n_clusters, comp…
lmcinnes 71a3c98
Get n_components correct.
lmcinnes 801ffa1
Update docstrings.
lmcinnes c84496f
Fix the parents array when we don't get the "full tree"
lmcinnes 8b291ad
Add single linkage to agglomerative clustering example.
lmcinnes fc97792
Add single linkage to digits agglomerative clustering example.
lmcinnes b187fb5
Update documentation to reflect the addition of single linkage.
lmcinnes aa50b07
Update documentation to reflect the addition of single linkage.
lmcinnes 5d838bc
Pep8 fix for class declaration in cython
lmcinnes b5ba340
Fix heading in clustering docs
lmcinnes 67e63a1
Update the digits clustering text to reflect the new reality.
lmcinnes 73b8f4c
Provide a more complete comparison of the different linkage methods, …
lmcinnes 2895849
We don't need connectivity here, and we can ignore issues with warnin…
lmcinnes 3fc770f
Add an explicit test that single linkage successfully works on exampl…
lmcinnes c83c896
Update docs with a more complete comparison on linkage methods (scale…
lmcinnes e9234be
List formatting in example linkage comparison.
lmcinnes 3e1017e
Flake8 fixes.
lmcinnes 9ec7534
Flake8 fixes.
lmcinnes f5b9077
More Flake8 fixes.
lmcinnes 345ddd7
Fix agglomerative plot example with correct subplot spec
lmcinnes d0f709b
Explicitly test linkages (including single) produce results identical…
lmcinnes 3eed324
Fix comment on why we sort (consistency)
lmcinnes 0e1b511
Merge branch 'master' into single_linkage_clustering
lmcinnes 55f4d72
Fix indentation issue on line 799
lmcinnes d6d6e65
Docstring for single_linkage_label
lmcinnes a0613eb
Various fixes for jnothman's detailed comments.
lmcinnes 5f9207e
Merge branch 'master' into single_linkage_clustering
lmcinnes 6f8af80
Further corrections in cython (memoryviews all around in UnionFind)
lmcinnes 627eed3
Update WhatsNew for single linkage clustering.
lmcinnes 47f7e96
Merge branch 'master' into single_linkage_clustering
lmcinnes c6eaf47
Resync with master to get doc fixes
lmcinnes d5ffddd
Merge remote-tracking branch 'origin/single_linkage_clustering' into …
lmcinnes b737aac
Address Jake's concerns.
lmcinnes 2a3e59c
Merge branch 'master' into single_linkage_clustering
lmcinnes 3a8d505
Handle true zero distances by setting them to "epsilon" distances
lmcinnes 0cca718
Merge remote-tracking branch 'origin/single_linkage_clustering' into …
lmcinnes cb35449
Missed the memory view direct assignment fix.
lmcinnes b9c23e1
Missed .data in array fancy indexing for epsilon in place of zero val…
lmcinnes 276d265
Add test for identical points messing with sparse linkage clustering.
lmcinnes d33db41
Missing comma in test data declaration
lmcinnes f8b818e
Merge branch 'master' into single_linkage_clustering
lmcinnes 7bbaf7f
Correct arguments to _fix_connectivity
lmcinnes 1ec7beb
Flake8 fixes for new test.
lmcinnes cbd9b80
More flake8 fixes for new test.
lmcinnes 219b2e5
More flake8 fixes for new test.
lmcinnes 239e8f8
Test all the linkage methods for identical point issues
lmcinnes 5e4c22d
Remove comment; fix epsilon values
lmcinnes 7aae411
Cast precomputed distances to float64 for consistency
lmcinnes 3d42400
Turn bounds checking off; add docsting warning.
lmcinnes df6b9ce
Function spacing formatting issue
lmcinnes 8e0b38c
Make public and private versions of labelling.
lmcinnes 5abc614
more efficient is sorted check
lmcinnes 3f73d98
Explicit cast to cover all bases
lmcinnes 9bb2355
Address various issue in documentation and examples.
lmcinnes 66571ec
COSMIT: cosmetic changes
GaelVaroquaux File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,149 @@ | ||
| """ | ||
| ================================================================ | ||
| Comparing different hierarchical linkage methods on toy datasets | ||
| ================================================================ | ||
|
|
||
| This example shows characteristics of different linkage | ||
| methods for hierarchical clustering on datasets that are | ||
| "interesting" but still in 2D. | ||
|
|
||
| The main observations to make are: | ||
|
|
||
| - single linkage is fast, and can perform well on | ||
| non-globular data, but it performs poorly in the | ||
| presence of noise. | ||
| - average and complete linkage perform well on | ||
| cleanly separated globular clusters, but have mixed | ||
| results otherwise. | ||
| - Ward is the most effective method for noisy data. | ||
|
|
||
| While these examples give some intuition about the | ||
| algorithms, this intuition might not apply to very high | ||
| dimensional data. | ||
| """ | ||
| print(__doc__) | ||
|
|
||
| import time | ||
| import warnings | ||
|
|
||
| import numpy as np | ||
| import matplotlib.pyplot as plt | ||
|
|
||
| from sklearn import cluster, datasets | ||
| from sklearn.preprocessing import StandardScaler | ||
| from itertools import cycle, islice | ||
|
|
||
| np.random.seed(0) | ||
|
|
||
| ###################################################################### | ||
| # Generate datasets. We choose the size big enough to see the scalability | ||
| # of the algorithms, but not too big to avoid too long running times | ||
|
|
||
| n_samples = 1500 | ||
| noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5, | ||
| noise=.05) | ||
| noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05) | ||
| blobs = datasets.make_blobs(n_samples=n_samples, random_state=8) | ||
| no_structure = np.random.rand(n_samples, 2), None | ||
|
|
||
| # Anisotropicly distributed data | ||
| random_state = 170 | ||
| X, y = datasets.make_blobs(n_samples=n_samples, random_state=random_state) | ||
| transformation = [[0.6, -0.6], [-0.4, 0.8]] | ||
| X_aniso = np.dot(X, transformation) | ||
| aniso = (X_aniso, y) | ||
|
|
||
| # blobs with varied variances | ||
| varied = datasets.make_blobs(n_samples=n_samples, | ||
| cluster_std=[1.0, 2.5, 0.5], | ||
| random_state=random_state) | ||
|
|
||
| ###################################################################### | ||
| # Run the clustering and plot | ||
|
|
||
| # Set up cluster parameters | ||
| plt.figure(figsize=(9 * 1.3 + 2, 14.5)) | ||
| plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05, | ||
| hspace=.01) | ||
|
|
||
| plot_num = 1 | ||
|
|
||
| default_base = {'n_neighbors': 10, | ||
| 'n_clusters': 3} | ||
|
|
||
| datasets = [ | ||
| (noisy_circles, {'n_clusters': 2}), | ||
| (noisy_moons, {'n_clusters': 2}), | ||
| (varied, {'n_neighbors': 2}), | ||
| (aniso, {'n_neighbors': 2}), | ||
| (blobs, {}), | ||
| (no_structure, {})] | ||
|
|
||
| for i_dataset, (dataset, algo_params) in enumerate(datasets): | ||
| # update parameters with dataset-specific values | ||
| params = default_base.copy() | ||
| params.update(algo_params) | ||
|
|
||
| X, y = dataset | ||
|
|
||
| # normalize dataset for easier parameter selection | ||
| X = StandardScaler().fit_transform(X) | ||
|
|
||
| # ============ | ||
| # Create cluster objects | ||
| # ============ | ||
| ward = cluster.AgglomerativeClustering( | ||
| n_clusters=params['n_clusters'], linkage='ward') | ||
| complete = cluster.AgglomerativeClustering( | ||
| n_clusters=params['n_clusters'], linkage='complete') | ||
| average = cluster.AgglomerativeClustering( | ||
| n_clusters=params['n_clusters'], linkage='average') | ||
| single = cluster.AgglomerativeClustering( | ||
| n_clusters=params['n_clusters'], linkage='single') | ||
|
|
||
| clustering_algorithms = ( | ||
| ('Single Linkage', single), | ||
| ('Average Linkage', average), | ||
| ('Complete Linkage', complete), | ||
| ('Ward Linkage', ward), | ||
| ) | ||
|
|
||
| for name, algorithm in clustering_algorithms: | ||
| t0 = time.time() | ||
|
|
||
| # catch warnings related to kneighbors_graph | ||
| with warnings.catch_warnings(): | ||
| warnings.filterwarnings( | ||
| "ignore", | ||
| message="the number of connected components of the " + | ||
| "connectivity matrix is [0-9]{1,2}" + | ||
| " > 1. Completing it to avoid stopping the tree early.", | ||
| category=UserWarning) | ||
| algorithm.fit(X) | ||
|
|
||
| t1 = time.time() | ||
| if hasattr(algorithm, 'labels_'): | ||
| y_pred = algorithm.labels_.astype(np.int) | ||
| else: | ||
| y_pred = algorithm.predict(X) | ||
|
|
||
| plt.subplot(len(datasets), len(clustering_algorithms), plot_num) | ||
| if i_dataset == 0: | ||
| plt.title(name, size=18) | ||
|
|
||
| colors = np.array(list(islice(cycle(['#377eb8', '#ff7f00', '#4daf4a', | ||
| '#f781bf', '#a65628', '#984ea3', | ||
| '#999999', '#e41a1c', '#dede00']), | ||
| int(max(y_pred) + 1)))) | ||
| plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[y_pred]) | ||
|
|
||
| plt.xlim(-2.5, 2.5) | ||
| plt.ylim(-2.5, 2.5) | ||
| plt.xticks(()) | ||
| plt.yticks(()) | ||
| plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'), | ||
| transform=plt.gca().transAxes, size=15, | ||
| horizontalalignment='right') | ||
| plot_num += 1 | ||
|
|
||
| plt.show() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the added subplot, the figure got a bit more narrow and the titles are not well separated. I think that it would be useful to add a "\n" in the title between the name of the linkage and the timing.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I am not quite clear exactly what you would like (certainly the titles on the plots are a little tight). I've made some adjustments, but would welcome any further clarification as I suspect I am missing something here.
Edit: Ah -- you are referring to
examples/cluster/plot_agglomerative_clustering.pyI suspect. I can certainly fix that.