[MRG+2] Single linkage clustering #9372

lmcinnes · 2017-07-15T20:51:15Z

Work regarding issue #4103 -- addition of single linkage as a linkage option for hierarchical clustering.

This is the minimal changes required to get single linkage working. We take advantage of scipy.cluster for the basic single linkage, and scipy.sparse.csgraph for the connectivity-constrained case. Both of these could be improved upon at some point in the future if required.

I believe I caught most cases in the documentation where linkage strategies are mentioned and single linkage is relevant, but I admit I may not have caught them all.

… uneeded single_linkage.pyx file.

…ute_full_tree=False working

amueller

needs tests ;)

amueller · 2017-07-15T20:52:21Z

doc/modules/clustering.rst

 (or distance used in clustering) cannot be varied with Ward, thus for non
-Euclidean metrics, average linkage is a good alternative.
+Euclidean metrics, average linkage is a good alternative. Single linkage,
+while not robust to noisy data, can computed very efficiently and can


be computed

amueller · 2017-07-15T20:55:18Z

sklearn/cluster/hierarchical.py

+        mst_array = mst_array[np.argsort(mst_array.T[2]),:]
+
+        # Convert edge list into standard hierarchical clustering format
+        single_linkage_tree = _hierarchical.single_linkage_label(mst_array)


is this faster then connected_components?

Oh, I guess we're computing the linkage tree here? Are we using this later? I feel like we usually are ok with just the clustering, but I don't use these algorithms often.

This converts the MST into scipy.cluster.hierarchy format. It just uses a union find so is pretty fast, and I already had code that does this (and was well tested), so it seemed the most efficient way to get something working. I am open to other options.

amueller · 2017-07-15T20:55:38Z

sklearn/cluster/hierarchical.py

+
+        # Compute parents
+        parent = np.arange(n_nodes, dtype=np.intp)
+        for i, (left, right) in enumerate(children_, n_samples):


Isn't this slow?

Relatively speaking, no. I can cythonize it if you are concerned.

if no, then please don't ;) I'd rather avoid cython if it's not necessary.
Can you maybe post a benchmark of this against the other linkage criteria?

Here is a scaling performance comparison for the sparse matrix case:

Most of the time is single linkage is actually spent fixing the connectivity matrix -- I'm factoring that out of the test and will post that soon, but it already looks pretty good.

The dense case is simply scipy.cluster.hierarchy, so I presume the performance of single linkage there is already demonstrated.

Here we are with the connectivity factored out (performed beforehand so it doesn't need to occur in the fit method).

sorry can you explain the difference between the two graphs again? The cythonization of the tree creation?

Both graphs use the same code in hierarchy_.py, the difference is in the data passed to in. In the first graph the sparse matrix that is passed in is not guaranteed to have a single connected component, which then requires some pre-processing to "fix" that by adding extra entries to the matrix. This pre-processing is common to all the linkage approaches, but isn't part of the algorithms per se (but the fit method will do it if it is required). The second graph simple ensures that the required pre-processing is done before the fit method is called, and the timing is thus only for the actual linkage methods, and no longer includes the common pre-processing step (which can be time consuming).

The iteration and tuple unpacking is likely to be slow relative to using children_.tolist(), but it's unimportant

amueller · 2017-07-15T21:00:56Z

I guess this is more in line with how the other algorithms are implemented, so maybe @GaelVaroquaux has a more informed opinion. Are we supporting a pre-specified graph here?

lmcinnes · 2017-07-15T21:03:50Z

With regard to tests: I modified the existing test suite to exercise single linkage in all the general linkage tests, so it is definitely getting tested. Did you want some new tests specific to single linkage? What did you have in mind?

With regard to pre-specified graphs: yes those are supported, presuming I am interpreting what you mean by that correctly (a connectivity (sparse) matrix constraining things).

GaelVaroquaux · 2017-07-15T21:04:11Z

sklearn/cluster/_hierarchical.pyx

+################################################################################
+# Efficient labelling/conversion of MSTs to single linkage hierarchies
+
+cdef class UnionFind (object):


To respect pep8, I'd rather avoid the space before the "(" above.

Done. Sorry about that.

GaelVaroquaux · 2017-07-15T21:06:09Z

doc/modules/clustering.rst


-Different linkage type: Ward, complete and average linkage
+Different linkage type: Ward, complete, average and single linkage
 -----------------------------------------------------------


You need to update the length of the line below.

Done. Thanks!

GaelVaroquaux · 2017-07-15T21:08:07Z

I guess this is more in line with how the other algorithms are implemented, so maybe @GaelVaroquaux has a more informed opinion.

What's the question here? I fail to follow.

amueller · 2017-07-15T21:08:51Z

I'm not sure what the current tests for the linkage critera are. It would be nice if there was a test with an example that has obvious different solutions for the different linkage critera. There's also an example for them, I think it would be nice to add single linkage here.

@GaelVaroquaux the question was whether we want to create a linkage tree, but I suppose the answer is yes.

GaelVaroquaux · 2017-07-15T21:10:48Z

@GaelVaroquaux the question was whether we want to create a linkage tree, but I suppose the answer is yes.

I think so too.

lmcinnes · 2017-07-15T21:13:41Z

I'll see if I can craft a couple of test examples that are suitable. I'll also see If I can write a simple example demonstrating the effects of the different linkages (I already added single linkage to the existing examples where it made sense).

lmcinnes · 2017-07-15T21:18:42Z

If I make an example using the current cluster comparison on toy datasets framework demonstrating the results under different linkage methods would that be a suitable example?

amueller · 2017-07-15T21:23:24Z

sure, sounds good. I imagine the example that's already there doesn't show the differences that clearly?

lmcinnes · 2017-07-15T21:29:07Z

The current example is on a 2D embedding of the digits dataset. That isn't terrible, but it doesn't highlight the different particularly. In fact because it is a noisy embedding single linkage just picks out the noise points on the fringe and calls everything else one cluster. It certainly demonstrates the shortcomings, but fails to demonstrate the cases where it can be useful.

…highlighting the relative strengths and weaknesses.

lmcinnes · 2018-01-18T16:17:56Z

Rather it is the public version that is never used (although could be in the future by interested parties), the cpdefd version is called from hierarchal.py as _hierarchical._single_linkage_label with an array that should definitely be valid (assuming the MST code works correctly). I'm pretty flexible on this, I'm just not sure what the "right" solution is. The current version provides a valid function should the user ever want to reuse the internals, but makes use of the faster unvalidated version where we know we are correct. That seems to cover most cases, and suffers only from an extraneous "future-proofing" function.

jakevdp · 2018-01-18T16:22:02Z

OK, sounds good.

lmcinnes · 2018-01-18T16:24:40Z

Thanks for all the help @jakevdp and @jnothman, it was greatly appreciated, and I gained a lot from working with both of you in ironing out the many fine details.

jakevdp · 2018-01-18T16:36:51Z

sklearn/cluster/_hierarchical.pyx

+        raise ValueError("Input MST array is not a validly formatted MST array")
+
+    if not np.all(np.sort(L[:, 2]) == L[:, 2]):
+        raise ValueError("Input MST array must be sorted by weight")


Sorry, one more comment: I'd change this to

is_sorted = lambda x: np.all(x[:-1] <= x[1:]) if not is_sorted(L[:, 2]): raise ...

no need to actually perform a full (expensive) sort to check if the array is sorted.

You could do it all in one line, of course, but I think using the lambda function makes it easier for someone to skim the code and understand what's going on.

jakevdp · 2018-01-18T16:52:53Z

The last remaining issue is that if the connectivity matrix has integer types, our strategy for creating epsilon values will lead to weird results. If that's ever a possibility, we should proactively convert the connectivity matrix to float64, because that's what's eventually done within the minimum_spanning_tree function anyway.

lmcinnes · 2018-01-18T16:57:31Z

Assuming that _single_linkage_tree is only called internally then I believe this is resolved already in that connectivity data is either populated with the results of paired_distances which is float64, or with an explicit cast to float64 of the content of X (in the case that affinity='precomputed'. I'm happy to add a cast in _single_linkage_tree to ensure safety if it gets called any other way in the future.

jakevdp · 2018-01-18T17:01:20Z

Makes sense. Thanks for all the great work on this @lmcinnes!

GaelVaroquaux · 2018-01-19T10:54:53Z

doc/modules/clustering.rst


-Different linkage type: Ward, complete and average linkage
-----------------------------------------------------------
+Different linkage type: Ward, complete, average and single linkage


Nitpick: Oxford coma "... average, and single".

GaelVaroquaux · 2018-01-19T10:57:19Z

examples/cluster/plot_linkage_comparison.py

+# ============
+plt.figure(figsize=(9 * 2 + 3, 12.5))
+plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,
+                    hspace=.01)


Cosmetic: it seems that the aspect ratio of the figure isn't great:
https://17238-843222-gh.circle-artifacts.com/0/doc/auto_examples/cluster/plot_linkage_comparison.html

GaelVaroquaux · 2018-01-19T10:59:30Z

examples/cluster/plot_linkage_comparison.py

+# ============
+# Generate datasets. We choose the size big enough to see the scalability
+# of the algorithms, but not too big to avoid too long running times
+# ============


Should we rather use sphinx-gallery separators here: insert a line of continuous "#" longer than 70 chars before the block and only before. Sphinx-gallery will use this to create html rendering and Jupyter notebooks.

GaelVaroquaux · 2018-01-19T11:03:37Z

examples/cluster/plot_digits_linkage.py

 from sklearn.cluster import AgglomerativeClustering

-for linkage in ('ward', 'average', 'complete'):
+for linkage in ('ward', 'average', 'complete', 'single'):


With the added subplot, the figure got a bit more narrow and the titles are not well separated. I think that it would be useful to add a "\n" in the title between the name of the linkage and the timing.

Sorry, but I am not quite clear exactly what you would like (certainly the titles on the plots are a little tight). I've made some adjustments, but would welcome any further clarification as I suspect I am missing something here.

Edit: Ah -- you are referring to examples/cluster/plot_agglomerative_clustering.py I suspect. I can certainly fix that.

GaelVaroquaux · 2018-01-19T12:30:24Z

Comments as I go (sorry, as I can get interrupted any time, I will paste the comments here in separate messages):

In the documentation, in the clustering.rst, in the warning box "Connectivity constraints with average and complete linkage", "single" should be added to the list of linkages that are brittle.

In the corresponding example, "plot_agglomerative_clustering.py", there also needs to add "single" to each enumeration containing complete and average. It would be useful to stress that single linkage is even more brittle than complete and average linkage.

GaelVaroquaux · 2018-01-19T12:36:14Z

sklearn/cluster/_hierarchical.pyx

+    return result_arr
+
+
+def single_linkage_label(L):


This seems very much like a pure Python function (it has not typing information). Any reason to have it in a Cython file?

I was keeping it together with the private cythonized function above -- this is the future-proof wrapper in case future users wish to use the routine safely (i.e. with appropriate checks). It is not currently called at all. I am certainly happy to move it, but it is not clear to me what the appropriate place is at this time.

GaelVaroquaux · 2018-01-19T12:54:00Z

sklearn/cluster/hierarchical.py

+    from scipy.sparse.csgraph import minimum_spanning_tree
+
+    # explicitly cast connectivity to ensure safety
+    connectivity = connectivity.astype('float64')


Would it be possible to use fused type and support both float64 and float32, in the interest of memory? We've been slowly but surely trying to make it so that in scikit-learn there is support of 32 and 64 bit in the interest of memory. It also tends to make code faster (data fits more in CPU cache).

I would certainly be willing to look into that (I don't currently know how, but presume it can't be too hard). As @jakevdp pointed out, however, the first thing the minimum_spanning_tree will do is convert to float64 so ultimately it is going to end up converted in a few lines anyway; by converting here we can do the finessing of zero distances without getting into type specific difficulties (in particular integer types). Is it worth using the fused type here and then letting scipy do the conversion in minimum_spanning_tree?

GaelVaroquaux · 2018-01-19T12:57:56Z

Awesome PR!!! Thanks you so much.

I made a bunch of comments, but I finished reviewing it. Most of them are minor. The comment on the fused type is the only non trivial one.

I am +1 for merge after those comments have been addressed.

lmcinnes · 2018-01-19T23:40:04Z

I've commented on points of discussion; the rest are straightforward and I'll try to get to them tonight or tomorrow.

Most of them were addressed easily, but clarification of a few points would be helpful. Thanks for all the feedback.

GaelVaroquaux · 2018-01-22T12:41:35Z

doc/modules/clustering.rst


-Different linkage type: Ward, complete, average and single linkage
+Different linkage type: Ward, complete, average, and single linkage
 ------------------------------------------------------------------


Sorry to always come back with a comment, but haven't you forgotten to extend the line below?

GaelVaroquaux · 2018-01-22T12:52:28Z

@lmcinnes : point taken on the suggestion of cython fused types.

I'll address the last cosmetic comments myself and merge. We have enough +1s on this one. Thanks a lot, this is a big deal!!

GaelVaroquaux · 2018-01-22T13:24:14Z

I've pushed the cosmetic changes. I'll merge once travis is green.

GaelVaroquaux · 2018-01-22T13:56:23Z

All checks have passed. Merging! Hurray!

amueller · 2018-01-22T21:20:54Z

OH YEAH!!!

GaelVaroquaux · 2018-01-22T21:25:41Z

OH YEAH!!!

Wait! I need a tiny review on a doc miss: #10520

lmcinnes added 13 commits July 15, 2017 10:40

First cut at basic single linkage internals

f25a0eb

Refer to correct dist_metrics package

2ed4799

Add csgraph sparse implementation for single linkage

acfbddf

Add fast labelling/conversion from MST to single linkage tree; remove…

2d5a95e

… uneeded single_linkage.pyx file.

Ensure existing tests cover single linkage

b5fa65b

Name cingle linkage labelling correctly.

2d25d1c

Iterating toward correct solution. Still have to get n_clusters, comp…

0a14920

…ute_full_tree=False working

Get n_components correct.

71a3c98

Update docstrings.

801ffa1

Fix the parents array when we don't get the "full tree"

c84496f

Add single linkage to agglomerative clustering example.

8b291ad

Add single linkage to digits agglomerative clustering example.

fc97792

Update documentation to reflect the addition of single linkage.

b187fb5

amueller reviewed Jul 15, 2017

View reviewed changes

GaelVaroquaux reviewed Jul 15, 2017

View reviewed changes

Update documentation to reflect the addition of single linkage.

aa50b07

GaelVaroquaux reviewed Jul 15, 2017

View reviewed changes

lmcinnes added 2 commits July 15, 2017 16:06

Pep8 fix for class declaration in cython

5d838bc

Fix heading in clustering docs

b5ba340

lmcinnes added 2 commits July 15, 2017 16:51

Update the digits clustering text to reflect the new reality.

67e63a1

Provide a more complete comparison of the different linkage methods, …

73b8f4c

…highlighting the relative strengths and weaknesses.

jakevdp approved these changes Jan 18, 2018

View reviewed changes

jakevdp reviewed Jan 18, 2018

View reviewed changes

more efficient is sorted check

5abc614

Explicit cast to cover all bases

3f73d98

jakevdp changed the title ~~[MRG+1] Single linkage clustering~~ [MRG+2] Single linkage clustering Jan 18, 2018

GaelVaroquaux reviewed Jan 19, 2018

View reviewed changes

Address various issue in documentation and examples.

9bb2355

GaelVaroquaux reviewed Jan 22, 2018

View reviewed changes

COSMIT: cosmetic changes

66571ec

GaelVaroquaux merged commit 8233829 into scikit-learn:master Jan 22, 2018

qinhanmin2014 mentioned this pull request Jan 23, 2018

Implement single linkage clustering #4103

Closed

Uh oh!

[MRG+2] Single linkage clustering #9372

[MRG+2] Single linkage clustering #9372

Uh oh!

Conversation

lmcinnes commented Jul 15, 2017

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lmcinnes Nov 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Jul 15, 2017

Uh oh!

lmcinnes commented Jul 15, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Jul 15, 2017 via email

Uh oh!

amueller commented Jul 15, 2017

Uh oh!

GaelVaroquaux commented Jul 15, 2017 via email

Uh oh!

lmcinnes commented Jul 15, 2017

Uh oh!

lmcinnes commented Jul 15, 2017

Uh oh!

amueller commented Jul 15, 2017

Uh oh!

lmcinnes commented Jul 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lmcinnes commented Jan 18, 2018

Uh oh!

jakevdp commented Jan 18, 2018

Uh oh!

lmcinnes commented Jan 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakevdp Jan 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakevdp commented Jan 18, 2018

Uh oh!

lmcinnes commented Jan 18, 2018

Uh oh!

jakevdp commented Jan 18, 2018

lmcinnes Nov 23, 2017 •

edited

Loading

lmcinnes commented Jul 15, 2017 •

edited

Loading

jakevdp Jan 18, 2018 •

edited

Loading

lmcinnes Jan 20, 2018 •

edited

Loading

lmcinnes commented Jan 19, 2018 •

edited

Loading