-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
FEA Generalize the use of precomputed sparse distance matr… #10482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to neighbors/base.py more-or-less duplicate a limited version of the work in #10206. Please consider adopting at least the tests from neighbors/tests/test_neighbors.py there, if not the implementation (but that is more WIP).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good design. It makes using precomputed sparse matrices in DBSCAN much clearer, and it avoids complicated new API design proposed in #8999. It should be easy for existing ANN implementations to be adapted to this design.
The key caveat is that it requires all required neighborhoods to be computed in advance. This may be a memory burden still for something like DBSCAN (although our current implementation is already memory heavy). It also doesn't handle the case that an estimator repeatedly fits a nearest neighbor object, but I think those are rare.
It would be really good if we could illustrate plugging in an approximate nearest neighbors estimator, but it would likely include adding a pyann or similar dependency, unless we attempt to write a very short LSH just to prove the point.
But I think for consistency we need to have RadiusNeighborsTransformer and KNeighborsTransformer separately.
I also think we should adopt or adapt as much of the existing work at #10206 as possible. We've long iterated on it already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good design. It makes using precomputed sparse matrices in DBSCAN much clearer, and it avoids complicated new API design proposed in #8999. It should be easy for existing ANN implementations to be adapted to this design.
The key caveat is that it requires all required neighborhoods to be computed in advance. This may be a memory burden still for something like DBSCAN (although our current implementation is already memory heavy). It also doesn't handle the case that an estimator repeatedly fits a nearest neighbor object, but I think those are rare.
It would be really good if we could illustrate plugging in an approximate nearest neighbors estimator, but it would likely include adding a pyann or similar dependency, unless we attempt to write a very short LSH just to prove the point.
But I think for consistency we need to have RadiusNeighborsTransformer and KNeighborsTransformer separately.
I also think we should adopt or adapt as much of the existing work at #10206 as possible. We've long iterated on it already.
Nice work @TomDLT and thanks for the summary in #10463.
The API with a transformer looks quite nice, and caching the NN can certainly help for repeated estimators fits with different parameters. However, as far as ANN are concerned I don't find it immediately obvious that pre-computing a sparse distance matrix with ANN, and e.g. using it in DBSCAN has an equivalent performance to directly using the ANN in the estimator. For instance, I would be curious to know the results of the following benchmark. Take some low dimensional data where we know that kd_tree will perform well (e.g. |
I'm not sure what you're trying to show about ANN with that, @rth. Currently DBSCAN calculates the nearest neighbors of all X once and computes the clustering from that. There is some overhead when passing in sparse D rather than just passing in features X: unpacking, thresholding on radius and repacking. This is not how it was originally implemented, mind you, and was optimised for speed rather than memory (see e.g. #5275). Perhaps with a *NeighborsTransformer, we could go back to that original low memory implementation, but get most of the speed benefits as long as the user provides the precomputed sparse matrix. One disadvantage of this design as opposed to #8999 is that there the host algorithm can pass in a One advantage of this design as opposed to #8999 is that it may allow you to precompute more neighbors than you need, then estimate downstream with various parameters (e.g. DBSCAN.eps); thus it works well with caching pipelines. Of course you could do that with #8999 by implementing your own precomputed neighborhood wrapper. Regardless of all this, I think it would be good to give an example of, or at least describe, the use of this for exploiting approximate nearest neighbor approaches. Writing up a simple (but not super-efficient) gaussian projection hash, or a minhash, might not be excessive. |
Thanks for the feedback. additional TODO:
|
I suggest we do something like: def _check_precomputed(D, n_neighbors=None):
if not issparse(D):
return check_array(D, ensure_min_features=n_neighbors)
if X.format not in ('csr', 'csc', 'coo', 'lil'):
raise TypeError('Sparse matrix in {!r} format is not supported due to its handling of explicit zeros'.format(D.format))
copied = X.format != 'csr'
D = D.tocsr()
if n_neighbors is not None and np.diff(D.indptr).min() < n_neighbors:
raise ValueError('Provide at least {} precomputed sparse neighbors. Got {}'.format(n_neighbors, np.diff(D.indptr).min()))
# check sort
out_of_order = D.data[:-1] > D.data[1:]
if out_of_order.sum() != out_of_order.take(D.indptr[1:-1], mode='clip').sum():
warnings.warn('Precomputed sparse input was not sorted by data.', EfficiencyWarning)
if not copied:
D = D.copy()
for start, stop in zip(D.indptr, D.indptr[1:]):
order = np.argsort(D.data[start:stop], kind='mergesort')
D.data[start:stop] = D.data[order]
D.indices[start:stop] = D.indices[order]
return D It's a pity the sorting check takes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice example and benchmark! Seeing as it's too slow to run with benefits in the example gallery anyway, keeping it there and not requiring the dependency looks like a reasonable idea.
Conflicts: sklearn/manifold/t_sne.py sklearn/manifold/tests/test_t_sne.py
Everything is red (and that might be my fault) |
We kind of hide this implementation detail in our |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the additional commits that adds details about KNeighborsTransformer
, I think this PR is ready to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor versionadded in docstrings
Awesome! I think you're right about making the compatible API clear would
be great. But note that all we need is for the external tool to provide *at
least* as many neighbors as are required downstream. So the downstream
estimator should actually raise an alarm if something isn't sufficient
upstream. ANN approaches may not actually need the constraint of providing
*exactly* k neighbors. For example, LSH would be able to just return those
samples whose hash matched the query without necessarily limiting it to the
top k.
|
Awesome !
I added a bit more details in the documentation, tell me if you think something else should be added. |
I like this. As a user, I would always include one more just to keep things simple . (Or a few more to have the flexibility to explore parameters downstream.) |
Thank you @TomDLT ! |
Thanks a lot! Great job!!
|
Now it opens a lot of opportunity to try to precompute neighbors with state of the art approximate NN implementations such as annoy or https://github.com/kakao/n2. |
It's, it absolutely "annoying". :).
I'm looking forward to seeing benchmarks on real-world applications.
|
oh boy ! big congrats @TomDLT for never giving up on this one ! 🍻 |
I think it's a brilliant solution to something that's bugged us for a
while, and yes, amazing how Tom has persisted! Let's see what people do
with it!
|
Awesome ! |
This PR implements solution (A) proposed in #10463.
Fixes #2792, #7426, #9691, #9994, #10463, #13596
Closes #3922, #7876, #8999, #10206
Gives a workaround for #8611
It generalizes the use of precomputed sparse distance matrices, and proposes to use pipelines to chain nearest neighbors with other estimators.
More precisely, it introduces new estimators
KNeighborsTransformer
andRadiusNeighborsTransformer
, which transforms an input X into a (weighted) graph of its nearest neighbors. The output is a sparse distance/affinity matrix.The proposed usecase is the following:
In this example
est_chain
andest_compact
are equivalent, butest_chain
gives more control on the nearest neighbors estimator. Moreover, it can use the pipeline caching properties, to reuse the same nearest neighbors matrix for different TSNE parameters. Finally, it is possible to replaceKNeighborsTransformer
with a custom nearest neighbors estimator returning the graph of connections with the nearest neighbors.This pipeline works with:
TSNE
,Isomap
,SpectralEmbedding
DBSCAN
,SpectralClustering
,KNeighborsClassifier
,RadiusNeighborsClassifier
,KNeighborsRegressor
,RadiusNeighborsRegressor
,LocalOutlierFactor
This pipeline cannot work with:
LocallyLinearEmbedding
(which needs the original X)Possible improvements: