[MRG+1] Add DBSCAN support for additional metric params #8139

naoyak · 2017-01-01T10:09:33Z

Added support for the additional metrics_params argument in DBSCAN (i.e. passing {'p': p} when using 'minkowski' for the metric)
Added a test to confirm that this works

jnothman

Thanks for the contribution

jnothman · 2017-01-01T11:49:50Z

sklearn/cluster/dbscan_.py

@@ -50,6 +50,11 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
        must be square. X may be a sparse matrix, in which case only "nonzero"
        elements may be considered neighbors for DBSCAN.

+    metric_params : dict, optional (default = None)


I think "optional" is sufficient when the semantics of the default are not clear.

jnothman · 2017-01-01T11:49:57Z

sklearn/cluster/dbscan_.py

@@ -184,6 +190,11 @@ class DBSCAN(BaseEstimator, ClusterMixin):
        .. versionadded:: 0.17
           metric *precomputed* to accept precomputed sparse matrix.

+   metric_params : dict, optional (default = None)


I think "optional" is sufficient when the semantics of the default are not clear.

jnothman · 2017-01-01T11:56:01Z

sklearn/cluster/tests/test_dbscan.py

+    # Different eps to other test, because distance is not normalised.
+    eps = 0.8
+    min_samples = 10
+    p = 2


this isn't a very strong test as p=2 is the default for minkowski in our nearest neighbors implementation. trying this for p=1 would be more demonstrative. Or both.

Changing to p=1.

jnothman · 2017-01-01T11:58:15Z

sklearn/cluster/tests/test_dbscan.py

+
+    # number of clusters, ignoring noise if present
+    n_clusters_1 = len(set(labels)) - int(-1 in labels)
+    assert_equal(n_clusters_1, n_clusters)


I would've thought that the complete set of labels were identical. Why are we using as weak an assertion as this?

That struck me as odd too until I just read up on DBSCAN and its implementation:

The algorithm is non-deterministic, but the core samples will always belong to the same clusters (although the labels may be different). The non-determinism comes from deciding to which cluster a non-core sample belongs. A non-core sample can have a distance lower than eps to two core samples in different clusters. By the triangular inequality, those two core samples must be more distant than eps from each other, or they would be in the same cluster. The non-core sample is assigned to whichever cluster is generated first, where the order is determined randomly. Other than the ordering of the dataset, the algorithm is deterministic, making the results relatively stable between runs on the same data.

So if you run the algorithm with the same parameters twice, it can differ in two ways:

The cluster labels themselves (meaning cluster 1 can be cluster 2 when run again, I think)

The non-core samples (which are included here in labels returned by dbscan()

Even then I agree that the tests for DBSCAN seem pretty weak. It should probably at least test whether the core samples end up in the same clusterings, though you have to match the cluster labels between runs.

I'm not sure that I'd be the best person to undertake a refactoring of the DBSCAN tests, though. If they aren't up to par it might be best addressed in a separate issue.

For a given random state the labels should be deterministic. The only difference here might be numerical precision errors, but they shouldn't be an issue, I'd think.

You can start by making this one better.

Note that that blurb on determinism has been changed in master: http://scikit-learn.org/dev/modules/clustering.html#dbscan

And apologies: no random_state involved.

Oh wow, the new explanation is much stronger (did the implementation change?).

You're right, the labels (as well as the core sample subset) are deterministic - I rewrote the test to validate this in the latest commit. Maybe at some point I can get around to refactoring some of the other tests.

Oh wow, the new explanation is much stronger (did the implementation change?).

The implementation didn't change. @jbednar, who was frustrated at the previous description's inaccuracy, is to thank.

naoyak · 2017-01-01T13:46:02Z

Updated the new test to hopefully be a bit more rigorous.

jnothman

LGTM

agramfort · 2017-01-02T12:39:39Z

thx @naoyak

…#8139) * Add DBSCAN support for additional metric params

naoyak changed the title ~~Add DBSCAN support for additional metric params~~ [MRG] Add DBSCAN support for additional metric params Jan 1, 2017

Add DBSCAN support for additional metric params

5cd2357

naoyak force-pushed the dbscan-metric-params branch from f8a83fe to 5cd2357 Compare January 1, 2017 11:40

jnothman reviewed Jan 1, 2017

View reviewed changes

Improve test for DBSCAN metric_params

c19173f

jnothman reviewed Jan 1, 2017

View reviewed changes

jnothman changed the title ~~[MRG] Add DBSCAN support for additional metric params~~ [MRG+1] Add DBSCAN support for additional metric params Jan 1, 2017

agramfort merged commit 543b056 into scikit-learn:master Jan 2, 2017

naoyak deleted the dbscan-metric-params branch January 3, 2017 20:37

raghavrv pushed a commit to raghavrv/scikit-learn that referenced this pull request Jan 5, 2017

[MRG+1] Add DBSCAN support for additional metric params (scikit-learn…

ab44b4f

…#8139) * Add DBSCAN support for additional metric params

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017

[MRG+1] Add DBSCAN support for additional metric params (scikit-learn…

406a629

…#8139) * Add DBSCAN support for additional metric params

Przemo10 mentioned this pull request Mar 17, 2017

update fork (#1) #8606

Closed

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] Add DBSCAN support for additional metric params (scikit-learn…

146c829

…#8139) * Add DBSCAN support for additional metric params

NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+1] Add DBSCAN support for additional metric params (scikit-learn…

7d270a4

…#8139) * Add DBSCAN support for additional metric params

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] Add DBSCAN support for additional metric params (scikit-learn…

9f039c8

…#8139) * Add DBSCAN support for additional metric params

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+1] Add DBSCAN support for additional metric params (scikit-learn…

ce31f9b

…#8139) * Add DBSCAN support for additional metric params

Uh oh!

[MRG+1] Add DBSCAN support for additional metric params #8139

[MRG+1] Add DBSCAN support for additional metric params #8139

Uh oh!

Conversation

naoyak commented Jan 1, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naoyak commented Jan 1, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

agramfort commented Jan 2, 2017

Uh oh!

Uh oh!