FEA Generalize the use of precomputed sparse distance matr… #10482

TomDLT · 2018-01-16T18:29:49Z

This PR implements solution (A) proposed in #10463.

Fixes #2792, #7426, #9691, #9994, #10463, #13596
Closes #3922, #7876, #8999, #10206
Gives a workaround for #8611

It generalizes the use of precomputed sparse distance matrices, and proposes to use pipelines to chain nearest neighbors with other estimators.
More precisely, it introduces new estimators KNeighborsTransformer and RadiusNeighborsTransformer, which transforms an input X into a (weighted) graph of its nearest neighbors. The output is a sparse distance/affinity matrix.

The proposed usecase is the following:

from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsTransformer
from sklearn.manifold import TSNE

est_chain = make_pipeline(
    KNeighborsTransformer(n_neighbors=n_neighbors, mode='distance',
                          metric=metric, include_self=False),
    TSNE(metric='precomputed', method="barnes_hut"),
    memory='path/to/cache')

est_compact = TSNE(metric=metric, method="barnes_hut")

In this example est_chain and est_compact are equivalent, but est_chain gives more control on the nearest neighbors estimator. Moreover, it can use the pipeline caching properties, to reuse the same nearest neighbors matrix for different TSNE parameters. Finally, it is possible to replace KNeighborsTransformer with a custom nearest neighbors estimator returning the graph of connections with the nearest neighbors.

This pipeline works with:

TSNE, Isomap, SpectralEmbedding
DBSCAN, SpectralClustering,
KNeighborsClassifier, RadiusNeighborsClassifier, KNeighborsRegressor, RadiusNeighborsRegressor, LocalOutlierFactor

This pipeline cannot work with:

LocallyLinearEmbedding (which needs the original X)

Possible improvements:

Add a kernel parameter to transform distances into affinities
Add symmetric option for affinities (to remove warning in SpectralEmbedding, SpectralClustering)

jnothman

The changes to neighbors/base.py more-or-less duplicate a limited version of the work in #10206. Please consider adopting at least the tests from neighbors/tests/test_neighbors.py there, if not the implementation (but that is more WIP).

sklearn/neighbors/base.py

jnothman

I think this is a good design. It makes using precomputed sparse matrices in DBSCAN much clearer, and it avoids complicated new API design proposed in #8999. It should be easy for existing ANN implementations to be adapted to this design.

The key caveat is that it requires all required neighborhoods to be computed in advance. This may be a memory burden still for something like DBSCAN (although our current implementation is already memory heavy). It also doesn't handle the case that an estimator repeatedly fits a nearest neighbor object, but I think those are rare.

It would be really good if we could illustrate plugging in an approximate nearest neighbors estimator, but it would likely include adding a pyann or similar dependency, unless we attempt to write a very short LSH just to prove the point.

But I think for consistency we need to have RadiusNeighborsTransformer and KNeighborsTransformer separately.

I also think we should adopt or adapt as much of the existing work at #10206 as possible. We've long iterated on it already.

sklearn/neighbors/unsupervised.py

jnothman

I think this is a good design. It makes using precomputed sparse matrices in DBSCAN much clearer, and it avoids complicated new API design proposed in #8999. It should be easy for existing ANN implementations to be adapted to this design.

The key caveat is that it requires all required neighborhoods to be computed in advance. This may be a memory burden still for something like DBSCAN (although our current implementation is already memory heavy). It also doesn't handle the case that an estimator repeatedly fits a nearest neighbor object, but I think those are rare.

It would be really good if we could illustrate plugging in an approximate nearest neighbors estimator, but it would likely include adding a pyann or similar dependency, unless we attempt to write a very short LSH just to prove the point.

But I think for consistency we need to have RadiusNeighborsTransformer and KNeighborsTransformer separately.

I also think we should adopt or adapt as much of the existing work at #10206 as possible. We've long iterated on it already.

rth · 2018-01-23T08:34:05Z

Nice work @TomDLT and thanks for the summary in #10463.

The key caveat is that it requires all required neighborhoods to be computed in advance. This may be a memory burden still for something like DBSCAN (although our current implementation is already memory heavy).
It would be really good if we could illustrate plugging in an approximate nearest neighbors estimator, but it would likely include adding a pyann or similar dependency, unless we attempt to write a very short LSH just to prove the point.

The API with a transformer looks quite nice, and caching the NN can certainly help for repeated estimators fits with different parameters. However, as far as ANN are concerned I don't find it immediately obvious that pre-computing a sparse distance matrix with ANN, and e.g. using it in DBSCAN has an equivalent performance to directly using the ANN in the estimator.

For instance, I would be curious to know the results of the following benchmark. Take some low dimensional data where we know that kd_tree will perform well (e.g. n_features=2). Use it to pre-compute the sparse distance matrix and fit DBSCAN. Compare the performance to using kd_tree directly in the DBSCAN. This would allow estimating the overhead of this approach, without necessarily having to add an ANN implementation.

jnothman · 2018-01-23T09:53:23Z

I'm not sure what you're trying to show about ANN with that, @rth. Currently DBSCAN calculates the nearest neighbors of all X once and computes the clustering from that. There is some overhead when passing in sparse D rather than just passing in features X: unpacking, thresholding on radius and repacking.

This is not how it was originally implemented, mind you, and was optimised for speed rather than memory (see e.g. #5275). Perhaps with a *NeighborsTransformer, we could go back to that original low memory implementation, but get most of the speed benefits as long as the user provides the precomputed sparse matrix.

One disadvantage of this design as opposed to #8999 is that there the host algorithm can pass in a radius or a n_neighbors, which would save time relative to precomputing and then filtering/checking the precomputed input. We also do not currently require that the precomputed input is in sorted order, so there is a O(n_train_samples log n_train_samples) cost per query in order to sort it. We could make some constraints on precomputed sparse matrix input corresponding to *neighbors_graph(None, mode='distance') output (or check it and raise a warning but accept it) if we'd rather: it needs to be in CSR with each row's data increasing, and at train time the diagonal needs to be implicit, not stored zeros.

One advantage of this design as opposed to #8999 is that it may allow you to precompute more neighbors than you need, then estimate downstream with various parameters (e.g. DBSCAN.eps); thus it works well with caching pipelines. Of course you could do that with #8999 by implementing your own precomputed neighborhood wrapper.

Regardless of all this, I think it would be good to give an example of, or at least describe, the use of this for exploiting approximate nearest neighbor approaches. Writing up a simple (but not super-efficient) gaussian projection hash, or a minhash, might not be excessive.

TomDLT · 2018-01-23T10:49:05Z

Thanks for the feedback.

additional TODO:

Separate into RadiusNeighborsTransformer and KNeighborsTransformer
Consider adding a constraint on graph sorting order
Adapt as much of [MRG] Modifies T-SNE for sparse matrix #10206 as possible, in particular unpacking with different n_neighbors/radius
Add small speed benchmark to quantify unpacking overhead
Add example with ANN

jnothman · 2018-01-23T22:54:25Z

I suggest we do something like:

def _check_precomputed(D, n_neighbors=None):
    if not issparse(D):
        return check_array(D, ensure_min_features=n_neighbors)
    if X.format not in ('csr', 'csc', 'coo', 'lil'):
        raise TypeError('Sparse matrix in {!r} format is not supported due to its handling of explicit zeros'.format(D.format))
    copied = X.format != 'csr'
    D = D.tocsr()
    if n_neighbors is not None and np.diff(D.indptr).min() < n_neighbors:
        raise ValueError('Provide at least {} precomputed sparse neighbors. Got {}'.format(n_neighbors, np.diff(D.indptr).min()))
    # check sort
    out_of_order = D.data[:-1] > D.data[1:]
    if out_of_order.sum() != out_of_order.take(D.indptr[1:-1], mode='clip').sum():
        warnings.warn('Precomputed sparse input was not sorted by data.', EfficiencyWarning)
        if not copied:
            D = D.copy()
        for start, stop in zip(D.indptr, D.indptr[1:]):
            order = np.argsort(D.data[start:stop], kind='mergesort')
            D.data[start:stop] = D.data[order]
            D.indices[start:stop] = D.indices[order]
    return D

It's a pity the sorting check takes O(nnz) memory and time

jnothman

Nice example and benchmark! Seeing as it's too slow to run with benefits in the example gallery anyway, keeping it there and not requiring the dependency looks like a reasonable idea.

doc/modules/neighbors.rst

sklearn/neighbors/base.py

Conflicts: sklearn/manifold/t_sne.py sklearn/manifold/tests/test_t_sne.py

jnothman · 2018-01-27T10:30:34Z

Everything is red (and that might be my fault)

thomasjpfan · 2019-09-18T01:25:31Z

We kind of hide this implementation detail in our KNeighborsTransformer, but a custom *NeighborsTransformer would need to know when to do this. (Which is only shown in the example)

thomasjpfan

After the additional commits that adds details about KNeighborsTransformer, I think this PR is ready to merge.

thomasjpfan

Minor versionadded in docstrings

sklearn/neighbors/base.py

jnothman · 2019-09-18T02:43:29Z

Awesome! I think you're right about making the compatible API clear would be great. But note that all we need is for the external tool to provide *at least* as many neighbors as are required downstream. So the downstream estimator should actually raise an alarm if something isn't sufficient upstream. ANN approaches may not actually need the constraint of providing *exactly* k neighbors. For example, LSH would be able to just return those samples whose hash matched the query without necessarily limiting it to the top k.

TomDLT · 2019-09-18T17:55:27Z

I think this PR is ready to merge.

Awesome !

it is unclear when to set n_neighbors = self.n_neighbors + 1 in the implementation of a custom *NeighborsTransformer

This happens only for KNeighborsTransformer, since RadiusNeighborsTransformer does not need a n_neighbors parameter.
This is not really a strong API specification, since passing too many neighbors will always work (the following estimators will just filter the extra neighbors). Passing too few neighbors will raise an error.
The distinction mode == 'distance' is only for internal compatibility of n_neighbors definitions. A custom estimator does not need to do this distinction.

I added a bit more details in the documentation, tell me if you think something else should be added.

thomasjpfan · 2019-09-18T18:01:32Z

To maximise compatiblity with all estimators, a safe choice is to always include one extra neighbor in a custom nearest neighbors estimator, since unnecessary neighbors will be filtered by following estimators.

I like this. As a user, I would always include one more just to keep things simple . (Or a few more to have the flexibility to explore parameters downstream.)

thomasjpfan · 2019-09-18T18:33:16Z

Thank you @TomDLT !

GaelVaroquaux · 2019-09-18T19:03:58Z

Thanks a lot! Great job!!

ogrisel · 2019-09-18T19:16:50Z

Now it opens a lot of opportunity to try to precompute neighbors with state of the art approximate NN implementations such as annoy or https://github.com/kakao/n2.

GaelVaroquaux · 2019-09-18T19:22:19Z

It's, it absolutely "annoying". :). I'm looking forward to seeing benchmarks on real-world applications.

agramfort · 2019-09-18T19:47:40Z

oh boy ! big congrats @TomDLT for never giving up on this one ! 🍻

jnothman · 2019-09-18T20:01:23Z

I think it's a brilliant solution to something that's bugged us for a while, and yes, amazing how Tom has persisted! Let's see what people do with it!

TomDLT · 2019-09-18T23:10:35Z

Awesome !
Thanks a lot for all the comments and reviews !

TomDLT and others added 6 commits January 16, 2018 18:28

add NearestNeighborsTransformer

938d57f

WIP use sparse neighbors graph in TSNE

b93abc0

Make pipeline works for TSNE, DBSCAN and Isomap

7813449

add support and test for KNeighbors and RadiusNeighbors estimators

4cc508d

add test for SpectralClustering and SpectralEmbedding

6044f7e

add support and test to LocalOutlierFactor

8917d84

jnothman reviewed Jan 16, 2018

View reviewed changes

sklearn/neighbors/base.py Outdated Show resolved Hide resolved

jnothman reviewed Jan 23, 2018

View reviewed changes

sklearn/neighbors/unsupervised.py Outdated Show resolved Hide resolved

jnothman reviewed Jan 23, 2018

View reviewed changes

separate into RadiusNeighborsTransformer and KNeighborsTransformer

1654eb1

TomDLT force-pushed the nn_api branch from e4d28d3 to 1654eb1 Compare January 23, 2018 13:01

rth mentioned this pull request Jan 23, 2018

Toward a consistent API for NearestNeighbors & co #10463

Closed

add some doc and an bench/example with annoy

68b7bf3

scikit-learn deleted a comment from sklearn-lgtm Jan 23, 2018

jnothman reviewed Jan 23, 2018

View reviewed changes

doc/modules/neighbors.rst Outdated Show resolved Hide resolved

doc/modules/neighbors.rst Outdated Show resolved Hide resolved

doc/modules/neighbors.rst Outdated Show resolved Hide resolved

jnothman reviewed Jan 23, 2018

View reviewed changes

sklearn/neighbors/base.py Show resolved Hide resolved

jnothman mentioned this pull request Jan 24, 2018

[MRG] Modifies T-SNE for sparse matrix #10206

Closed

TomDLT and others added 4 commits January 26, 2018 11:52

improve doc

f610b91

squashed 10206

96c9d94

Conflicts: sklearn/manifold/t_sne.py sklearn/manifold/tests/test_t_sne.py

update

69f456c

Merge branch 'master' into nn_api

9c112cb

TomDLT force-pushed the nn_api branch from ee1303f to e75ea28 Compare January 26, 2018 17:52

cleaning

c573bc2

TomDLT force-pushed the nn_api branch from e75ea28 to c573bc2 Compare January 26, 2018 18:18

add joel' alternative, needs benchmarking

29f6f01

thomasjpfan approved these changes Sep 18, 2019

View reviewed changes

thomasjpfan reviewed Sep 18, 2019

View reviewed changes

sklearn/neighbors/base.py Show resolved Hide resolved

sklearn/neighbors/base.py Show resolved Hide resolved

TomDLT added 2 commits September 18, 2019 10:20

DOC add versionadded

682d7a0

DOC add more details about adding extra neighbors

a24735a

thomasjpfan changed the title ~~[MRG+1] Generalize the use of precomputed sparse distance matrices, for estimators which use nearest neighbors~~ FEA Generalize the use of precomputed sparse distance matr… Sep 18, 2019

thomasjpfan merged commit e52e9c8 into scikit-learn:master Sep 18, 2019

rth mentioned this pull request Sep 24, 2019

Multithreaded kneighbors_graph in TSNE #15082

Merged

glemaitre mentioned this pull request Jan 7, 2020

Using dbscan with precomputed neighbors gives an error in 0.22.X, but not in 0.21.3. #16036

Closed

TomDLT mentioned this pull request Jul 8, 2020

MNT Avoid duplicate validation of data in supervised NN #17849

Merged

TomDLT mentioned this pull request Oct 13, 2020

FIX sort radius neighbors results when sort_results=True and algorithm="brute" #18612

Merged

TomDLT mentioned this pull request Sep 30, 2021

FIX Non-fit methods no long raises UserWarning for valid dataframes #21199

Merged

lesteve mentioned this pull request Dec 17, 2021

Run more examples that do not start with plot_ on CircleCI #8849

Open

adrinjalali mentioned this pull request Apr 18, 2024

sklearn.cluster.DBSCAN: Allow for 'nan' values for user-defined metric #4283

Closed

Uh oh!

FEA Generalize the use of precomputed sparse distance matr… #10482

FEA Generalize the use of precomputed sparse distance matr… #10482

Uh oh!

Conversation

TomDLT commented Jan 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

rth commented Jan 23, 2018

Uh oh!

jnothman commented Jan 23, 2018

Uh oh!

TomDLT commented Jan 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Jan 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jnothman commented Jan 27, 2018

Uh oh!

thomasjpfan commented Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jnothman commented Sep 18, 2019 via email

Uh oh!

TomDLT commented Sep 18, 2019

Uh oh!

thomasjpfan commented Sep 18, 2019

Uh oh!

thomasjpfan commented Sep 18, 2019

Uh oh!

GaelVaroquaux commented Sep 18, 2019 via email

Uh oh!

ogrisel commented Sep 18, 2019

Uh oh!

GaelVaroquaux commented Sep 18, 2019 via email

Uh oh!

agramfort commented Sep 18, 2019

Uh oh!

jnothman commented Sep 18, 2019 via email

Uh oh!

TomDLT commented Sep 18, 2019

Uh oh!

Uh oh!

TomDLT commented Jan 16, 2018 •

edited

Loading

TomDLT commented Jan 23, 2018 •

edited

Loading

jnothman commented Jan 23, 2018 •

edited

Loading

thomasjpfan commented Sep 18, 2019 •

edited

Loading