-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
[MRG+2] Single linkage clustering #9372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Single linkage clustering #9372
Conversation
β¦ uneeded single_linkage.pyx file.
β¦ute_full_tree=False working
amueller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs tests ;)
doc/modules/clustering.rst
Outdated
| (or distance used in clustering) cannot be varied with Ward, thus for non | ||
| Euclidean metrics, average linkage is a good alternative. | ||
| Euclidean metrics, average linkage is a good alternative. Single linkage, | ||
| while not robust to noisy data, can computed very efficiently and can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be computed
sklearn/cluster/hierarchical.py
Outdated
| mst_array = mst_array[np.argsort(mst_array.T[2]),:] | ||
|
|
||
| # Convert edge list into standard hierarchical clustering format | ||
| single_linkage_tree = _hierarchical.single_linkage_label(mst_array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this faster then connected_components?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I guess we're computing the linkage tree here? Are we using this later? I feel like we usually are ok with just the clustering, but I don't use these algorithms often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This converts the MST into scipy.cluster.hierarchy format. It just uses a union find so is pretty fast, and I already had code that does this (and was well tested), so it seemed the most efficient way to get something working. I am open to other options.
sklearn/cluster/hierarchical.py
Outdated
|
|
||
| # Compute parents | ||
| parent = np.arange(n_nodes, dtype=np.intp) | ||
| for i, (left, right) in enumerate(children_, n_samples): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this slow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relatively speaking, no. I can cythonize it if you are concerned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if no, then please don't ;) I'd rather avoid cython if it's not necessary.
Can you maybe post a benchmark of this against the other linkage criteria?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a scaling performance comparison for the sparse matrix case:

Most of the time is single linkage is actually spent fixing the connectivity matrix -- I'm factoring that out of the test and will post that soon, but it already looks pretty good.
The dense case is simply scipy.cluster.hierarchy, so I presume the performance of single linkage there is already demonstrated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry can you explain the difference between the two graphs again? The cythonization of the tree creation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both graphs use the same code in hierarchy_.py, the difference is in the data passed to in. In the first graph the sparse matrix that is passed in is not guaranteed to have a single connected component, which then requires some pre-processing to "fix" that by adding extra entries to the matrix. This pre-processing is common to all the linkage approaches, but isn't part of the algorithms per se (but the fit method will do it if it is required). The second graph simple ensures that the required pre-processing is done before the fit method is called, and the timing is thus only for the actual linkage methods, and no longer includes the common pre-processing step (which can be time consuming).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The iteration and tuple unpacking is likely to be slow relative to using children_.tolist(), but it's unimportant
|
I guess this is more in line with how the other algorithms are implemented, so maybe @GaelVaroquaux has a more informed opinion. Are we supporting a pre-specified graph here? |
|
With regard to tests: I modified the existing test suite to exercise single linkage in all the general linkage tests, so it is definitely getting tested. Did you want some new tests specific to single linkage? What did you have in mind? With regard to pre-specified graphs: yes those are supported, presuming I am interpreting what you mean by that correctly (a connectivity (sparse) matrix constraining things). |
sklearn/cluster/_hierarchical.pyx
Outdated
| ################################################################################ | ||
| # Efficient labelling/conversion of MSTs to single linkage hierarchies | ||
|
|
||
| cdef class UnionFind (object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To respect pep8, I'd rather avoid the space before the "(" above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Sorry about that.
doc/modules/clustering.rst
Outdated
|
|
||
| Different linkage type: Ward, complete and average linkage | ||
| Different linkage type: Ward, complete, average and single linkage | ||
| ----------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to update the length of the line below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks!
|
I guess this is more in line with how the other algorithms are implemented, so maybe @GaelVaroquaux has a more informed opinion.
What's the question here? I fail to follow.
|
|
I'm not sure what the current tests for the linkage critera are. It would be nice if there was a test with an example that has obvious different solutions for the different linkage critera. There's also an example for them, I think it would be nice to add single linkage here. @GaelVaroquaux the question was whether we want to create a linkage tree, but I suppose the answer is yes. |
|
@GaelVaroquaux the question was whether we want to create a linkage tree, but I suppose the answer is yes.
I think so too.
|
|
I'll see if I can craft a couple of test examples that are suitable. I'll also see If I can write a simple example demonstrating the effects of the different linkages (I already added single linkage to the existing examples where it made sense). |
|
If I make an example using the current cluster comparison on toy datasets framework demonstrating the results under different linkage methods would that be a suitable example? |
|
sure, sounds good. I imagine the example that's already there doesn't show the differences that clearly? |
|
The current example is on a 2D embedding of the digits dataset. That isn't terrible, but it doesn't highlight the different particularly. In fact because it is a noisy embedding single linkage just picks out the noise points on the fringe and calls everything else one cluster. It certainly demonstrates the shortcomings, but fails to demonstrate the cases where it can be useful. |
β¦highlighting the relative strengths and weaknesses.
|
Rather it is the public version that is never used (although could be in the future by interested parties), the |
|
OK, sounds good. |
| raise ValueError("Input MST array is not a validly formatted MST array") | ||
|
|
||
| if not np.all(np.sort(L[:, 2]) == L[:, 2]): | ||
| raise ValueError("Input MST array must be sorted by weight") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, one more comment: I'd change this to
is_sorted = lambda x: np.all(x[:-1] <= x[1:])
if not is_sorted(L[:, 2]):
raise ...no need to actually perform a full (expensive) sort to check if the array is sorted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do it all in one line, of course, but I think using the lambda function makes it easier for someone to skim the code and understand what's going on.
|
The last remaining issue is that if the connectivity matrix has integer types, our strategy for creating epsilon values will lead to weird results. If that's ever a possibility, we should proactively convert the connectivity matrix to float64, because that's what's eventually done within the |
|
Assuming that |
|
Makes sense. Thanks for all the great work on this @lmcinnes! |
doc/modules/clustering.rst
Outdated
|
|
||
| Different linkage type: Ward, complete and average linkage | ||
| ----------------------------------------------------------- | ||
| Different linkage type: Ward, complete, average and single linkage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Oxford coma "... average, and single".
| # ============ | ||
| plt.figure(figsize=(9 * 2 + 3, 12.5)) | ||
| plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05, | ||
| hspace=.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cosmetic: it seems that the aspect ratio of the figure isn't great:
https://17238-843222-gh.circle-artifacts.com/0/doc/auto_examples/cluster/plot_linkage_comparison.html
| # ============ | ||
| # Generate datasets. We choose the size big enough to see the scalability | ||
| # of the algorithms, but not too big to avoid too long running times | ||
| # ============ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rather use sphinx-gallery separators here: insert a line of continuous "#" longer than 70 chars before the block and only before. Sphinx-gallery will use this to create html rendering and Jupyter notebooks.
| from sklearn.cluster import AgglomerativeClustering | ||
|
|
||
| for linkage in ('ward', 'average', 'complete'): | ||
| for linkage in ('ward', 'average', 'complete', 'single'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the added subplot, the figure got a bit more narrow and the titles are not well separated. I think that it would be useful to add a "\n" in the title between the name of the linkage and the timing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I am not quite clear exactly what you would like (certainly the titles on the plots are a little tight). I've made some adjustments, but would welcome any further clarification as I suspect I am missing something here.
Edit: Ah -- you are referring to examples/cluster/plot_agglomerative_clustering.py I suspect. I can certainly fix that.
|
Comments as I go (sorry, as I can get interrupted any time, I will paste the comments here in separate messages): In the documentation, in the clustering.rst, in the warning box "Connectivity constraints with average and complete linkage", "single" should be added to the list of linkages that are brittle. In the corresponding example, "plot_agglomerative_clustering.py", there also needs to add "single" to each enumeration containing complete and average. It would be useful to stress that single linkage is even more brittle than complete and average linkage. |
| return result_arr | ||
|
|
||
|
|
||
| def single_linkage_label(L): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very much like a pure Python function (it has not typing information). Any reason to have it in a Cython file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was keeping it together with the private cythonized function above -- this is the future-proof wrapper in case future users wish to use the routine safely (i.e. with appropriate checks). It is not currently called at all. I am certainly happy to move it, but it is not clear to me what the appropriate place is at this time.
| from scipy.sparse.csgraph import minimum_spanning_tree | ||
|
|
||
| # explicitly cast connectivity to ensure safety | ||
| connectivity = connectivity.astype('float64') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to use fused type and support both float64 and float32, in the interest of memory? We've been slowly but surely trying to make it so that in scikit-learn there is support of 32 and 64 bit in the interest of memory. It also tends to make code faster (data fits more in CPU cache).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would certainly be willing to look into that (I don't currently know how, but presume it can't be too hard). As @jakevdp pointed out, however, the first thing the minimum_spanning_tree will do is convert to float64 so ultimately it is going to end up converted in a few lines anyway; by converting here we can do the finessing of zero distances without getting into type specific difficulties (in particular integer types). Is it worth using the fused type here and then letting scipy do the conversion in minimum_spanning_tree?
|
Awesome PR!!! Thanks you so much. I made a bunch of comments, but I finished reviewing it. Most of them are minor. The comment on the fused type is the only non trivial one. I am +1 for merge after those comments have been addressed. |
|
I've commented on points of discussion; the rest are straightforward and I'll try to get to them tonight or tomorrow. Most of them were addressed easily, but clarification of a few points would be helpful. Thanks for all the feedback. |
doc/modules/clustering.rst
Outdated
|
|
||
| Different linkage type: Ward, complete, average and single linkage | ||
| Different linkage type: Ward, complete, average, and single linkage | ||
| ------------------------------------------------------------------ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to always come back with a comment, but haven't you forgotten to extend the line below?
|
@lmcinnes : point taken on the suggestion of cython fused types. I'll address the last cosmetic comments myself and merge. We have enough +1s on this one. Thanks a lot, this is a big deal!! |
|
I've pushed the cosmetic changes. I'll merge once travis is green. |
|
All checks have passed. Merging! Hurray! |
|
OH YEAH!!! |
|
OH YEAH!!!
Wait! I need a tiny review on a doc miss:
#10520
|

Work regarding issue #4103 -- addition of single linkage as a linkage option for hierarchical clustering.
This is the minimal changes required to get single linkage working. We take advantage of
scipy.clusterfor the basic single linkage, andscipy.sparse.csgraphfor the connectivity-constrained case. Both of these could be improved upon at some point in the future if required.I believe I caught most cases in the documentation where linkage strategies are mentioned and single linkage is relevant, but I admit I may not have caught them all.