Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@lmcinnes
Copy link
Contributor

Work regarding issue #4103 -- addition of single linkage as a linkage option for hierarchical clustering.

This is the minimal changes required to get single linkage working. We take advantage of scipy.cluster for the basic single linkage, and scipy.sparse.csgraph for the connectivity-constrained case. Both of these could be improved upon at some point in the future if required.

I believe I caught most cases in the documentation where linkage strategies are mentioned and single linkage is relevant, but I admit I may not have caught them all.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs tests ;)

(or distance used in clustering) cannot be varied with Ward, thus for non
Euclidean metrics, average linkage is a good alternative.
Euclidean metrics, average linkage is a good alternative. Single linkage,
while not robust to noisy data, can computed very efficiently and can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be computed

mst_array = mst_array[np.argsort(mst_array.T[2]),:]

# Convert edge list into standard hierarchical clustering format
single_linkage_tree = _hierarchical.single_linkage_label(mst_array)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this faster then connected_components?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I guess we're computing the linkage tree here? Are we using this later? I feel like we usually are ok with just the clustering, but I don't use these algorithms often.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This converts the MST into scipy.cluster.hierarchy format. It just uses a union find so is pretty fast, and I already had code that does this (and was well tested), so it seemed the most efficient way to get something working. I am open to other options.


# Compute parents
parent = np.arange(n_nodes, dtype=np.intp)
for i, (left, right) in enumerate(children_, n_samples):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this slow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relatively speaking, no. I can cythonize it if you are concerned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if no, then please don't ;) I'd rather avoid cython if it's not necessary.
Can you maybe post a benchmark of this against the other linkage criteria?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a scaling performance comparison for the sparse matrix case:
image
Most of the time is single linkage is actually spent fixing the connectivity matrix -- I'm factoring that out of the test and will post that soon, but it already looks pretty good.

The dense case is simply scipy.cluster.hierarchy, so I presume the performance of single linkage there is already demonstrated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are with the connectivity factored out (performed beforehand so it doesn't need to occur in the fit method).

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry can you explain the difference between the two graphs again? The cythonization of the tree creation?

Copy link
Contributor Author

@lmcinnes lmcinnes Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both graphs use the same code in hierarchy_.py, the difference is in the data passed to in. In the first graph the sparse matrix that is passed in is not guaranteed to have a single connected component, which then requires some pre-processing to "fix" that by adding extra entries to the matrix. This pre-processing is common to all the linkage approaches, but isn't part of the algorithms per se (but the fit method will do it if it is required). The second graph simple ensures that the required pre-processing is done before the fit method is called, and the timing is thus only for the actual linkage methods, and no longer includes the common pre-processing step (which can be time consuming).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iteration and tuple unpacking is likely to be slow relative to using children_.tolist(), but it's unimportant

@amueller
Copy link
Member

I guess this is more in line with how the other algorithms are implemented, so maybe @GaelVaroquaux has a more informed opinion. Are we supporting a pre-specified graph here?

@lmcinnes
Copy link
Contributor Author

With regard to tests: I modified the existing test suite to exercise single linkage in all the general linkage tests, so it is definitely getting tested. Did you want some new tests specific to single linkage? What did you have in mind?

With regard to pre-specified graphs: yes those are supported, presuming I am interpreting what you mean by that correctly (a connectivity (sparse) matrix constraining things).

################################################################################
# Efficient labelling/conversion of MSTs to single linkage hierarchies

cdef class UnionFind (object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To respect pep8, I'd rather avoid the space before the "(" above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Sorry about that.


Different linkage type: Ward, complete and average linkage
Different linkage type: Ward, complete, average and single linkage
-----------------------------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to update the length of the line below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks!

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jul 15, 2017 via email

@amueller
Copy link
Member

I'm not sure what the current tests for the linkage critera are. It would be nice if there was a test with an example that has obvious different solutions for the different linkage critera. There's also an example for them, I think it would be nice to add single linkage here.

@GaelVaroquaux the question was whether we want to create a linkage tree, but I suppose the answer is yes.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jul 15, 2017 via email

@lmcinnes
Copy link
Contributor Author

I'll see if I can craft a couple of test examples that are suitable. I'll also see If I can write a simple example demonstrating the effects of the different linkages (I already added single linkage to the existing examples where it made sense).

@lmcinnes
Copy link
Contributor Author

If I make an example using the current cluster comparison on toy datasets framework demonstrating the results under different linkage methods would that be a suitable example?

@amueller
Copy link
Member

sure, sounds good. I imagine the example that's already there doesn't show the differences that clearly?

@lmcinnes
Copy link
Contributor Author

lmcinnes commented Jul 15, 2017

The current example is on a 2D embedding of the digits dataset. That isn't terrible, but it doesn't highlight the different particularly. In fact because it is a noisy embedding single linkage just picks out the noise points on the fringe and calls everything else one cluster. It certainly demonstrates the shortcomings, but fails to demonstrate the cases where it can be useful.

@lmcinnes
Copy link
Contributor Author

Rather it is the public version that is never used (although could be in the future by interested parties), the cpdefd version is called from hierarchal.py as _hierarchical._single_linkage_label with an array that should definitely be valid (assuming the MST code works correctly). I'm pretty flexible on this, I'm just not sure what the "right" solution is. The current version provides a valid function should the user ever want to reuse the internals, but makes use of the faster unvalidated version where we know we are correct. That seems to cover most cases, and suffers only from an extraneous "future-proofing" function.

@jakevdp
Copy link
Member

jakevdp commented Jan 18, 2018

OK, sounds good.

@lmcinnes
Copy link
Contributor Author

Thanks for all the help @jakevdp and @jnothman, it was greatly appreciated, and I gained a lot from working with both of you in ironing out the many fine details.

raise ValueError("Input MST array is not a validly formatted MST array")

if not np.all(np.sort(L[:, 2]) == L[:, 2]):
raise ValueError("Input MST array must be sorted by weight")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, one more comment: I'd change this to

is_sorted = lambda x: np.all(x[:-1] <= x[1:])
if not is_sorted(L[:, 2]):
   raise ...

no need to actually perform a full (expensive) sort to check if the array is sorted.

Copy link
Member

@jakevdp jakevdp Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do it all in one line, of course, but I think using the lambda function makes it easier for someone to skim the code and understand what's going on.

@jakevdp
Copy link
Member

jakevdp commented Jan 18, 2018

The last remaining issue is that if the connectivity matrix has integer types, our strategy for creating epsilon values will lead to weird results. If that's ever a possibility, we should proactively convert the connectivity matrix to float64, because that's what's eventually done within the minimum_spanning_tree function anyway.

@lmcinnes
Copy link
Contributor Author

Assuming that _single_linkage_tree is only called internally then I believe this is resolved already in that connectivity data is either populated with the results of paired_distances which is float64, or with an explicit cast to float64 of the content of X (in the case that affinity='precomputed'. I'm happy to add a cast in _single_linkage_tree to ensure safety if it gets called any other way in the future.

@jakevdp
Copy link
Member

jakevdp commented Jan 18, 2018

Makes sense. Thanks for all the great work on this @lmcinnes!

@jakevdp jakevdp changed the title [MRG+1] Single linkage clustering [MRG+2] Single linkage clustering Jan 18, 2018

Different linkage type: Ward, complete and average linkage
-----------------------------------------------------------
Different linkage type: Ward, complete, average and single linkage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Oxford coma "... average, and single".

# ============
plt.figure(figsize=(9 * 2 + 3, 12.5))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,
hspace=.01)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmetic: it seems that the aspect ratio of the figure isn't great:
https://17238-843222-gh.circle-artifacts.com/0/doc/auto_examples/cluster/plot_linkage_comparison.html

# ============
# Generate datasets. We choose the size big enough to see the scalability
# of the algorithms, but not too big to avoid too long running times
# ============
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rather use sphinx-gallery separators here: insert a line of continuous "#" longer than 70 chars before the block and only before. Sphinx-gallery will use this to create html rendering and Jupyter notebooks.

from sklearn.cluster import AgglomerativeClustering

for linkage in ('ward', 'average', 'complete'):
for linkage in ('ward', 'average', 'complete', 'single'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the added subplot, the figure got a bit more narrow and the titles are not well separated. I think that it would be useful to add a "\n" in the title between the name of the linkage and the timing.

Copy link
Contributor Author

@lmcinnes lmcinnes Jan 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I am not quite clear exactly what you would like (certainly the titles on the plots are a little tight). I've made some adjustments, but would welcome any further clarification as I suspect I am missing something here.

Edit: Ah -- you are referring to examples/cluster/plot_agglomerative_clustering.py I suspect. I can certainly fix that.

@GaelVaroquaux
Copy link
Member

Comments as I go (sorry, as I can get interrupted any time, I will paste the comments here in separate messages):

In the documentation, in the clustering.rst, in the warning box "Connectivity constraints with average and complete linkage", "single" should be added to the list of linkages that are brittle.

In the corresponding example, "plot_agglomerative_clustering.py", there also needs to add "single" to each enumeration containing complete and average. It would be useful to stress that single linkage is even more brittle than complete and average linkage.

return result_arr


def single_linkage_label(L):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very much like a pure Python function (it has not typing information). Any reason to have it in a Cython file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was keeping it together with the private cythonized function above -- this is the future-proof wrapper in case future users wish to use the routine safely (i.e. with appropriate checks). It is not currently called at all. I am certainly happy to move it, but it is not clear to me what the appropriate place is at this time.

from scipy.sparse.csgraph import minimum_spanning_tree

# explicitly cast connectivity to ensure safety
connectivity = connectivity.astype('float64')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to use fused type and support both float64 and float32, in the interest of memory? We've been slowly but surely trying to make it so that in scikit-learn there is support of 32 and 64 bit in the interest of memory. It also tends to make code faster (data fits more in CPU cache).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would certainly be willing to look into that (I don't currently know how, but presume it can't be too hard). As @jakevdp pointed out, however, the first thing the minimum_spanning_tree will do is convert to float64 so ultimately it is going to end up converted in a few lines anyway; by converting here we can do the finessing of zero distances without getting into type specific difficulties (in particular integer types). Is it worth using the fused type here and then letting scipy do the conversion in minimum_spanning_tree?

@GaelVaroquaux
Copy link
Member

Awesome PR!!! Thanks you so much.

I made a bunch of comments, but I finished reviewing it. Most of them are minor. The comment on the fused type is the only non trivial one.

I am +1 for merge after those comments have been addressed.

@lmcinnes
Copy link
Contributor Author

lmcinnes commented Jan 19, 2018

I've commented on points of discussion; the rest are straightforward and I'll try to get to them tonight or tomorrow.

Most of them were addressed easily, but clarification of a few points would be helpful. Thanks for all the feedback.


Different linkage type: Ward, complete, average and single linkage
Different linkage type: Ward, complete, average, and single linkage
------------------------------------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to always come back with a comment, but haven't you forgotten to extend the line below?

@GaelVaroquaux
Copy link
Member

@lmcinnes : point taken on the suggestion of cython fused types.

I'll address the last cosmetic comments myself and merge. We have enough +1s on this one. Thanks a lot, this is a big deal!!

@GaelVaroquaux
Copy link
Member

I've pushed the cosmetic changes. I'll merge once travis is green.

@GaelVaroquaux
Copy link
Member

All checks have passed. Merging! Hurray!

@GaelVaroquaux GaelVaroquaux merged commit 8233829 into scikit-learn:master Jan 22, 2018
@amueller
Copy link
Member

OH YEAH!!!

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 22, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants