-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] DBSCAN: faster, weighted samples, and sparse input #3994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Note because this permutes over a different array, results will not be identical to previous versions for a fixed random state. |
I have extended this PR from its initial purpose, such that DBSCAN now supports sample weights. Ping @robertlayton |
Hmm... just realised this leaves |
I think so. |
distance_matrix = True | ||
# Calculate neighborhood for all samples. This leaves the original point | ||
# in, which needs to be considered later (i.e. point i is the | ||
# neighborhood of point i. While True, its useless information) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
point i is in the neighborhood of point i. While true, it is useless information).
This looks great. +1 for merge. Don't forget to update the |
Thanks for the review (and the +1), @ogrisel! Your comments have been addressed. |
+1 for merge. nice job @jnothman ! |
[MRG+1] DBSCAN: faster, weighted samples, and sparse input
Hmm... I just noticed that DBSCAN isn't really well-tested, and in particular the boundary case of min_samples == number within radius isn't tested, despite the note:
This PR broke the previous behaviour but correctly matched the |
DBSCAN now supports sparse matrix input, and weighted samples (a compact representation of density when duplicate points exist; also useful when weighting is possible for BIRCH's global clustering).
I have also vectorized the implementation. This reduces the cluster comparison toy examples under DBSCAN from ~.6s each to .02s each on my machine.
(This could be sped up further by allowing a dual-tree radius_neighbors lookup that reuses the computed tree. The toy examples have
n_features=2
, so neighbor calculation does not take up as much of the overall time as it might otherwise.)