[MRG+1] DBSCAN: faster, weighted samples, and sparse input #3994

jnothman · 2014-12-23T12:04:49Z

DBSCAN now supports sparse matrix input, and weighted samples (a compact representation of density when duplicate points exist; also useful when weighting is possible for BIRCH's global clustering).

I have also vectorized the implementation. This reduces the cluster comparison toy examples under DBSCAN from ~.6s each to .02s each on my machine.

(This could be sped up further by allowing a dual-tree radius_neighbors lookup that reuses the computed tree. The toy examples have n_features=2, so neighbor calculation does not take up as much of the overall time as it might otherwise.)

jnothman · 2014-12-23T12:35:09Z

make_blobs(5000, n_features=100, centers=10)[0] clusters (with eps=45, min_samples=5) takes 13.4s in master, 3.2s in this branch.

Note because this permutes over a different array, results will not be identical to previous versions for a fixed random state.

jnothman · 2014-12-24T11:23:39Z

I have extended this PR from its initial purpose, such that DBSCAN now supports sample weights. Ping @robertlayton

jnothman · 2014-12-24T12:34:49Z

Hmm... just realised this leaves fit_predict without a sample_weight parameter. Is overriding the best solution?

ogrisel · 2014-12-24T13:44:04Z

Hmm... just realised this leaves fit_predict without a sample_weight parameter. Is overriding the best solution?

I think so.

ogrisel · 2014-12-24T13:46:41Z

sklearn/cluster/dbscan_.py

-    distance_matrix = True
+    # Calculate neighborhood for all samples. This leaves the original point
+    # in, which needs to be considered later (i.e. point i is the
+    # neighborhood of point i. While True, its useless information)


point i is in the neighborhood of point i. While true, it is useless information).

ogrisel · 2014-12-24T14:01:10Z

This looks great. +1 for merge. Don't forget to update the whats_new.rst file prior to merging.

jnothman · 2014-12-25T03:22:57Z

Thanks for the review (and the +1), @ogrisel! Your comments have been addressed.

agramfort · 2014-12-25T09:55:09Z

+1 for merge. nice job @jnothman !

[MRG+1] DBSCAN: faster, weighted samples, and sparse input

jnothman · 2015-01-10T13:50:23Z

Hmm... I just noticed that DBSCAN isn't really well-tested, and in particular the boundary case of min_samples == number within radius isn't tested, despite the note:

    # Calculate neighborhood for all samples. This leaves the original point
    # in, which needs to be considered later (i.e. point i is the
    # neighborhood of point i. While True, its useless information)

This PR broke the previous behaviour but correctly matched the min_samples docstring in my understanding (this PR tests n_neighbors > min_samples, while the previous used >=). I think we need a PR to revert to the previous behaviour, make the docstring more precise, and test this boundary case.

jnothman force-pushed the vec_dbscan branch from 0d8141e to 7435953 Compare December 23, 2014 12:14

jnothman changed the title ~~[MRG] ENH vectorize DBSCAN implementation~~ [MRG] DBSCAN: faster, weighted samples, and sparse input Dec 24, 2014

jnothman mentioned this pull request Dec 24, 2014

Support sample weight in clusterers #3998

Open

10 tasks

ogrisel reviewed Dec 24, 2014
View reviewed changes

ogrisel changed the title ~~[MRG] DBSCAN: faster, weighted samples, and sparse input~~ [MRG+1] DBSCAN: faster, weighted samples, and sparse input Dec 24, 2014

jnothman mentioned this pull request Dec 24, 2014

[MRG+1] LSHForest: sparse support and vectorised _find_longest_prefix_match #3991

Merged

jnothman added 5 commits December 25, 2014 14:22

ENH vectorize DBSCAN implementation

3633cb3

ENH sparse support in DBSCAN

8a240f1

ENH support weighted samples in DBSCAN

11569e3

FIX DBSCAN.fit_predict now also supports sample_weight

684cfb3

DOC/COSMIT clean up according to review

39c3417

jnothman force-pushed the vec_dbscan branch from 743a901 to 39c3417 Compare December 25, 2014 03:22

agramfort added a commit that referenced this pull request Dec 25, 2014

Merge pull request #3994 from jnothman/vec_dbscan

56ca8f9

[MRG+1] DBSCAN: faster, weighted samples, and sparse input

agramfort merged commit 56ca8f9 into scikit-learn:master Dec 25, 2014

jnothman mentioned this pull request Jan 10, 2015

DBSCAN Documentation (suggestions for parallel processing) #3879

Closed

This was referenced Jan 10, 2015

[MRG] FIX/TST boundary cases in dbscan #4073

Closed

[MRG + 1] Do not shuffle by default for DBSCAN. #4066

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG+1] DBSCAN: faster, weighted samples, and sparse input #3994

[MRG+1] DBSCAN: faster, weighted samples, and sparse input #3994

Uh oh!

jnothman commented Dec 23, 2014

Uh oh!

jnothman commented Dec 23, 2014

Uh oh!

jnothman commented Dec 24, 2014

Uh oh!

jnothman commented Dec 24, 2014

Uh oh!

ogrisel commented Dec 24, 2014

Uh oh!

ogrisel Dec 24, 2014

Uh oh!

ogrisel commented Dec 24, 2014

Uh oh!

jnothman commented Dec 25, 2014

Uh oh!

agramfort commented Dec 25, 2014

Uh oh!

jnothman commented Jan 10, 2015

Uh oh!

Uh oh!

Uh oh!

[MRG+1] DBSCAN: faster, weighted samples, and sparse input #3994

[MRG+1] DBSCAN: faster, weighted samples, and sparse input #3994

Uh oh!

Conversation

jnothman commented Dec 23, 2014

Uh oh!

jnothman commented Dec 23, 2014

Uh oh!

jnothman commented Dec 24, 2014

Uh oh!

jnothman commented Dec 24, 2014

Uh oh!

ogrisel commented Dec 24, 2014

Uh oh!

ogrisel Dec 24, 2014

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Dec 24, 2014

Uh oh!

jnothman commented Dec 25, 2014

Uh oh!

agramfort commented Dec 25, 2014

Uh oh!

jnothman commented Jan 10, 2015

Uh oh!

Uh oh!