Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] Add DBSCAN support for additional metric params #8139

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 2, 2017

Conversation

naoyak
Copy link
Contributor

@naoyak naoyak commented Jan 1, 2017

Fixes #4520

  • Added support for the additional metrics_params argument in DBSCAN (i.e. passing {'p': p} when using 'minkowski' for the metric)
  • Added a test to confirm that this works

@naoyak naoyak changed the title Add DBSCAN support for additional metric params [MRG] Add DBSCAN support for additional metric params Jan 1, 2017
@naoyak naoyak force-pushed the dbscan-metric-params branch from f8a83fe to 5cd2357 Compare January 1, 2017 11:40
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution

@@ -50,6 +50,11 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
must be square. X may be a sparse matrix, in which case only "nonzero"
elements may be considered neighbors for DBSCAN.

metric_params : dict, optional (default = None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "optional" is sufficient when the semantics of the default are not clear.

@@ -184,6 +190,11 @@ class DBSCAN(BaseEstimator, ClusterMixin):
.. versionadded:: 0.17
metric *precomputed* to accept precomputed sparse matrix.

metric_params : dict, optional (default = None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "optional" is sufficient when the semantics of the default are not clear.

# Different eps to other test, because distance is not normalised.
eps = 0.8
min_samples = 10
p = 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't a very strong test as p=2 is the default for minkowski in our nearest neighbors implementation. trying this for p=1 would be more demonstrative. Or both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing to p=1.


# number of clusters, ignoring noise if present
n_clusters_1 = len(set(labels)) - int(-1 in labels)
assert_equal(n_clusters_1, n_clusters)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would've thought that the complete set of labels were identical. Why are we using as weak an assertion as this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That struck me as odd too until I just read up on DBSCAN and its implementation:

The algorithm is non-deterministic, but the core samples will always belong to the same clusters (although the labels may be different). The non-determinism comes from deciding to which cluster a non-core sample belongs. A non-core sample can have a distance lower than eps to two core samples in different clusters. By the triangular inequality, those two core samples must be more distant than eps from each other, or they would be in the same cluster. The non-core sample is assigned to whichever cluster is generated first, where the order is determined randomly. Other than the ordering of the dataset, the algorithm is deterministic, making the results relatively stable between runs on the same data.

So if you run the algorithm with the same parameters twice, it can differ in two ways:

  • The cluster labels themselves (meaning cluster 1 can be cluster 2 when run again, I think)
  • The non-core samples (which are included here in labels returned by dbscan()

Even then I agree that the tests for DBSCAN seem pretty weak. It should probably at least test whether the core samples end up in the same clusterings, though you have to match the cluster labels between runs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I'd be the best person to undertake a refactoring of the DBSCAN tests, though. If they aren't up to par it might be best addressed in a separate issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a given random state the labels should be deterministic. The only difference here might be numerical precision errors, but they shouldn't be an issue, I'd think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can start by making this one better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that that blurb on determinism has been changed in master: http://scikit-learn.org/dev/modules/clustering.html#dbscan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And apologies: no random_state involved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, the new explanation is much stronger (did the implementation change?).

You're right, the labels (as well as the core sample subset) are deterministic - I rewrote the test to validate this in the latest commit. Maybe at some point I can get around to refactoring some of the other tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, the new explanation is much stronger (did the implementation change?).

The implementation didn't change. @jbednar, who was frustrated at the previous description's inaccuracy, is to thank.

@naoyak
Copy link
Contributor Author

naoyak commented Jan 1, 2017

Updated the new test to hopefully be a bit more rigorous.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jnothman jnothman changed the title [MRG] Add DBSCAN support for additional metric params [MRG+1] Add DBSCAN support for additional metric params Jan 1, 2017
@agramfort agramfort merged commit 543b056 into scikit-learn:master Jan 2, 2017
@agramfort
Copy link
Member

thx @naoyak

@naoyak naoyak deleted the dbscan-metric-params branch January 3, 2017 20:37
raghavrv pushed a commit to raghavrv/scikit-learn that referenced this pull request Jan 5, 2017
sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
@Przemo10 Przemo10 mentioned this pull request Mar 17, 2017
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants