Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] DBSCAN: faster, weighted samples, and sparse input #3994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Dec 25, 2014

Conversation

jnothman
Copy link
Member

DBSCAN now supports sparse matrix input, and weighted samples (a compact representation of density when duplicate points exist; also useful when weighting is possible for BIRCH's global clustering).

I have also vectorized the implementation. This reduces the cluster comparison toy examples under DBSCAN from ~.6s each to .02s each on my machine.

(This could be sped up further by allowing a dual-tree radius_neighbors lookup that reuses the computed tree. The toy examples have n_features=2, so neighbor calculation does not take up as much of the overall time as it might otherwise.)

@jnothman
Copy link
Member Author

make_blobs(5000, n_features=100, centers=10)[0] clusters (with eps=45, min_samples=5) takes 13.4s in master, 3.2s in this branch.

Note because this permutes over a different array, results will not be identical to previous versions for a fixed random state.

@jnothman jnothman changed the title [MRG] ENH vectorize DBSCAN implementation [MRG] DBSCAN: faster, weighted samples, and sparse input Dec 24, 2014
@jnothman
Copy link
Member Author

I have extended this PR from its initial purpose, such that DBSCAN now supports sample weights. Ping @robertlayton

@jnothman
Copy link
Member Author

Hmm... just realised this leaves fit_predict without a sample_weight parameter. Is overriding the best solution?

@ogrisel
Copy link
Member

ogrisel commented Dec 24, 2014

Hmm... just realised this leaves fit_predict without a sample_weight parameter. Is overriding the best solution?

I think so.

distance_matrix = True
# Calculate neighborhood for all samples. This leaves the original point
# in, which needs to be considered later (i.e. point i is the
# neighborhood of point i. While True, its useless information)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

point i is in the neighborhood of point i. While true, it is useless information).

@ogrisel
Copy link
Member

ogrisel commented Dec 24, 2014

This looks great. +1 for merge. Don't forget to update the whats_new.rst file prior to merging.

@ogrisel ogrisel changed the title [MRG] DBSCAN: faster, weighted samples, and sparse input [MRG+1] DBSCAN: faster, weighted samples, and sparse input Dec 24, 2014
@jnothman
Copy link
Member Author

Thanks for the review (and the +1), @ogrisel! Your comments have been addressed.

@agramfort
Copy link
Member

+1 for merge. nice job @jnothman !

agramfort added a commit that referenced this pull request Dec 25, 2014
[MRG+1] DBSCAN: faster, weighted samples, and sparse input
@agramfort agramfort merged commit 56ca8f9 into scikit-learn:master Dec 25, 2014
@jnothman
Copy link
Member Author

Hmm... I just noticed that DBSCAN isn't really well-tested, and in particular the boundary case of min_samples == number within radius isn't tested, despite the note:

    # Calculate neighborhood for all samples. This leaves the original point
    # in, which needs to be considered later (i.e. point i is the
    # neighborhood of point i. While True, its useless information)

This PR broke the previous behaviour but correctly matched the min_samples docstring in my understanding (this PR tests n_neighbors > min_samples, while the previous used >=). I think we need a PR to revert to the previous behaviour, make the docstring more precise, and test this boundary case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants