Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FEA Generalize the use of precomputed sparse distance matr… #10482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 132 commits into from
Sep 18, 2019

Conversation

TomDLT
Copy link
Member

@TomDLT TomDLT commented Jan 16, 2018

This PR implements solution (A) proposed in #10463.

Fixes #2792, #7426, #9691, #9994, #10463, #13596
Closes #3922, #7876, #8999, #10206
Gives a workaround for #8611

It generalizes the use of precomputed sparse distance matrices, and proposes to use pipelines to chain nearest neighbors with other estimators.
More precisely, it introduces new estimators KNeighborsTransformer and RadiusNeighborsTransformer, which transforms an input X into a (weighted) graph of its nearest neighbors. The output is a sparse distance/affinity matrix.

The proposed usecase is the following:

from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsTransformer
from sklearn.manifold import TSNE

est_chain = make_pipeline(
    KNeighborsTransformer(n_neighbors=n_neighbors, mode='distance',
                          metric=metric, include_self=False),
    TSNE(metric='precomputed', method="barnes_hut"),
    memory='path/to/cache')

est_compact = TSNE(metric=metric, method="barnes_hut")

In this example est_chain and est_compact are equivalent, but est_chain gives more control on the nearest neighbors estimator. Moreover, it can use the pipeline caching properties, to reuse the same nearest neighbors matrix for different TSNE parameters. Finally, it is possible to replace KNeighborsTransformer with a custom nearest neighbors estimator returning the graph of connections with the nearest neighbors.


This pipeline works with:

  • TSNE, Isomap, SpectralEmbedding
  • DBSCAN, SpectralClustering,
  • KNeighborsClassifier, RadiusNeighborsClassifier, KNeighborsRegressor, RadiusNeighborsRegressor, LocalOutlierFactor

This pipeline cannot work with:

  • LocallyLinearEmbedding (which needs the original X)

Possible improvements:

  • Add a kernel parameter to transform distances into affinities
  • Add symmetric option for affinities (to remove warning in SpectralEmbedding, SpectralClustering)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to neighbors/base.py more-or-less duplicate a limited version of the work in #10206. Please consider adopting at least the tests from neighbors/tests/test_neighbors.py there, if not the implementation (but that is more WIP).

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good design. It makes using precomputed sparse matrices in DBSCAN much clearer, and it avoids complicated new API design proposed in #8999. It should be easy for existing ANN implementations to be adapted to this design.

The key caveat is that it requires all required neighborhoods to be computed in advance. This may be a memory burden still for something like DBSCAN (although our current implementation is already memory heavy). It also doesn't handle the case that an estimator repeatedly fits a nearest neighbor object, but I think those are rare.

It would be really good if we could illustrate plugging in an approximate nearest neighbors estimator, but it would likely include adding a pyann or similar dependency, unless we attempt to write a very short LSH just to prove the point.

But I think for consistency we need to have RadiusNeighborsTransformer and KNeighborsTransformer separately.

I also think we should adopt or adapt as much of the existing work at #10206 as possible. We've long iterated on it already.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good design. It makes using precomputed sparse matrices in DBSCAN much clearer, and it avoids complicated new API design proposed in #8999. It should be easy for existing ANN implementations to be adapted to this design.

The key caveat is that it requires all required neighborhoods to be computed in advance. This may be a memory burden still for something like DBSCAN (although our current implementation is already memory heavy). It also doesn't handle the case that an estimator repeatedly fits a nearest neighbor object, but I think those are rare.

It would be really good if we could illustrate plugging in an approximate nearest neighbors estimator, but it would likely include adding a pyann or similar dependency, unless we attempt to write a very short LSH just to prove the point.

But I think for consistency we need to have RadiusNeighborsTransformer and KNeighborsTransformer separately.

I also think we should adopt or adapt as much of the existing work at #10206 as possible. We've long iterated on it already.

@rth
Copy link
Member

rth commented Jan 23, 2018

Nice work @TomDLT and thanks for the summary in #10463.

The key caveat is that it requires all required neighborhoods to be computed in advance. This may be a memory burden still for something like DBSCAN (although our current implementation is already memory heavy).
It would be really good if we could illustrate plugging in an approximate nearest neighbors estimator, but it would likely include adding a pyann or similar dependency, unless we attempt to write a very short LSH just to prove the point.

The API with a transformer looks quite nice, and caching the NN can certainly help for repeated estimators fits with different parameters. However, as far as ANN are concerned I don't find it immediately obvious that pre-computing a sparse distance matrix with ANN, and e.g. using it in DBSCAN has an equivalent performance to directly using the ANN in the estimator.

For instance, I would be curious to know the results of the following benchmark. Take some low dimensional data where we know that kd_tree will perform well (e.g. n_features=2). Use it to pre-compute the sparse distance matrix and fit DBSCAN. Compare the performance to using kd_tree directly in the DBSCAN. This would allow estimating the overhead of this approach, without necessarily having to add an ANN implementation.

@jnothman
Copy link
Member

I'm not sure what you're trying to show about ANN with that, @rth. Currently DBSCAN calculates the nearest neighbors of all X once and computes the clustering from that. There is some overhead when passing in sparse D rather than just passing in features X: unpacking, thresholding on radius and repacking.

This is not how it was originally implemented, mind you, and was optimised for speed rather than memory (see e.g. #5275). Perhaps with a *NeighborsTransformer, we could go back to that original low memory implementation, but get most of the speed benefits as long as the user provides the precomputed sparse matrix.

One disadvantage of this design as opposed to #8999 is that there the host algorithm can pass in a radius or a n_neighbors, which would save time relative to precomputing and then filtering/checking the precomputed input. We also do not currently require that the precomputed input is in sorted order, so there is a O(n_train_samples log n_train_samples) cost per query in order to sort it. We could make some constraints on precomputed sparse matrix input corresponding to *neighbors_graph(None, mode='distance') output (or check it and raise a warning but accept it) if we'd rather: it needs to be in CSR with each row's data increasing, and at train time the diagonal needs to be implicit, not stored zeros.

One advantage of this design as opposed to #8999 is that it may allow you to precompute more neighbors than you need, then estimate downstream with various parameters (e.g. DBSCAN.eps); thus it works well with caching pipelines. Of course you could do that with #8999 by implementing your own precomputed neighborhood wrapper.

Regardless of all this, I think it would be good to give an example of, or at least describe, the use of this for exploiting approximate nearest neighbor approaches. Writing up a simple (but not super-efficient) gaussian projection hash, or a minhash, might not be excessive.

@TomDLT
Copy link
Member Author

TomDLT commented Jan 23, 2018

Thanks for the feedback.

additional TODO:

  • Separate into RadiusNeighborsTransformer and KNeighborsTransformer
  • Consider adding a constraint on graph sorting order
  • Adapt as much of [MRG] Modifies T-SNE for sparse matrix #10206 as possible, in particular unpacking with different n_neighbors/radius
  • Add small speed benchmark to quantify unpacking overhead
  • Add example with ANN

@scikit-learn scikit-learn deleted a comment from sklearn-lgtm Jan 23, 2018
@jnothman
Copy link
Member

jnothman commented Jan 23, 2018

I suggest we do something like:

def _check_precomputed(D, n_neighbors=None):
    if not issparse(D):
        return check_array(D, ensure_min_features=n_neighbors)
    if X.format not in ('csr', 'csc', 'coo', 'lil'):
        raise TypeError('Sparse matrix in {!r} format is not supported due to its handling of explicit zeros'.format(D.format))
    copied = X.format != 'csr'
    D = D.tocsr()
    if n_neighbors is not None and np.diff(D.indptr).min() < n_neighbors:
        raise ValueError('Provide at least {} precomputed sparse neighbors. Got {}'.format(n_neighbors, np.diff(D.indptr).min()))
    # check sort
    out_of_order = D.data[:-1] > D.data[1:]
    if out_of_order.sum() != out_of_order.take(D.indptr[1:-1], mode='clip').sum():
        warnings.warn('Precomputed sparse input was not sorted by data.', EfficiencyWarning)
        if not copied:
            D = D.copy()
        for start, stop in zip(D.indptr, D.indptr[1:]):
            order = np.argsort(D.data[start:stop], kind='mergesort')
            D.data[start:stop] = D.data[order]
            D.indices[start:stop] = D.indices[order]
    return D

It's a pity the sorting check takes O(nnz) memory and time

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice example and benchmark! Seeing as it's too slow to run with benefits in the example gallery anyway, keeping it there and not requiring the dependency looks like a reasonable idea.

TomDLT and others added 4 commits January 26, 2018 11:52
Conflicts:
	sklearn/manifold/t_sne.py
	sklearn/manifold/tests/test_t_sne.py
@jnothman
Copy link
Member

Everything is red (and that might be my fault)

@thomasjpfan
Copy link
Member

thomasjpfan commented Sep 18, 2019

We kind of hide this implementation detail in our KNeighborsTransformer, but a custom *NeighborsTransformer would need to know when to do this. (Which is only shown in the example)

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the additional commits that adds details about KNeighborsTransformer, I think this PR is ready to merge.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor versionadded in docstrings

@jnothman
Copy link
Member

jnothman commented Sep 18, 2019 via email

@TomDLT
Copy link
Member Author

TomDLT commented Sep 18, 2019

I think this PR is ready to merge.

Awesome !

it is unclear when to set n_neighbors = self.n_neighbors + 1 in the implementation of a custom *NeighborsTransformer

  • This happens only for KNeighborsTransformer, since RadiusNeighborsTransformer does not need a n_neighbors parameter.
  • This is not really a strong API specification, since passing too many neighbors will always work (the following estimators will just filter the extra neighbors). Passing too few neighbors will raise an error.
  • The distinction mode == 'distance' is only for internal compatibility of n_neighbors definitions. A custom estimator does not need to do this distinction.

I added a bit more details in the documentation, tell me if you think something else should be added.

@thomasjpfan
Copy link
Member

To maximise compatiblity with all estimators, a safe choice is to always include one extra neighbor in a custom nearest neighbors estimator, since unnecessary neighbors will be filtered by following estimators.

I like this. As a user, I would always include one more just to keep things simple . (Or a few more to have the flexibility to explore parameters downstream.)

@thomasjpfan thomasjpfan changed the title [MRG+1] Generalize the use of precomputed sparse distance matrices, for estimators which use nearest neighbors FEA Generalize the use of precomputed sparse distance matr… Sep 18, 2019
@thomasjpfan thomasjpfan merged commit e52e9c8 into scikit-learn:master Sep 18, 2019
@thomasjpfan
Copy link
Member

Thank you @TomDLT !

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 18, 2019 via email

@ogrisel
Copy link
Member

ogrisel commented Sep 18, 2019

Now it opens a lot of opportunity to try to precompute neighbors with state of the art approximate NN implementations such as annoy or https://github.com/kakao/n2.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 18, 2019 via email

@agramfort
Copy link
Member

oh boy ! big congrats @TomDLT for never giving up on this one ! 🍻

@jnothman
Copy link
Member

jnothman commented Sep 18, 2019 via email

@TomDLT
Copy link
Member Author

TomDLT commented Sep 18, 2019

Awesome !
Thanks a lot for all the comments and reviews !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High Priority High priority issues and pull requests Waiting for Reviewer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to Isomap for using a precomputed neighborhood graph
10 participants