-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Add DBSCAN support for additional metric params #8139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f8a83fe
to
5cd2357
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution
@@ -50,6 +50,11 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', | |||
must be square. X may be a sparse matrix, in which case only "nonzero" | |||
elements may be considered neighbors for DBSCAN. | |||
|
|||
metric_params : dict, optional (default = None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "optional" is sufficient when the semantics of the default are not clear.
@@ -184,6 +190,11 @@ class DBSCAN(BaseEstimator, ClusterMixin): | |||
.. versionadded:: 0.17 | |||
metric *precomputed* to accept precomputed sparse matrix. | |||
|
|||
metric_params : dict, optional (default = None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "optional" is sufficient when the semantics of the default are not clear.
# Different eps to other test, because distance is not normalised. | ||
eps = 0.8 | ||
min_samples = 10 | ||
p = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't a very strong test as p=2 is the default for minkowski in our nearest neighbors implementation. trying this for p=1 would be more demonstrative. Or both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing to p=1.
|
||
# number of clusters, ignoring noise if present | ||
n_clusters_1 = len(set(labels)) - int(-1 in labels) | ||
assert_equal(n_clusters_1, n_clusters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would've thought that the complete set of labels were identical. Why are we using as weak an assertion as this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That struck me as odd too until I just read up on DBSCAN and its implementation:
The algorithm is non-deterministic, but the core samples will always belong to the same clusters (although the labels may be different). The non-determinism comes from deciding to which cluster a non-core sample belongs. A non-core sample can have a distance lower than eps to two core samples in different clusters. By the triangular inequality, those two core samples must be more distant than eps from each other, or they would be in the same cluster. The non-core sample is assigned to whichever cluster is generated first, where the order is determined randomly. Other than the ordering of the dataset, the algorithm is deterministic, making the results relatively stable between runs on the same data.
So if you run the algorithm with the same parameters twice, it can differ in two ways:
- The cluster labels themselves (meaning cluster 1 can be cluster 2 when run again, I think)
- The non-core samples (which are included here in
labels
returned bydbscan()
Even then I agree that the tests for DBSCAN seem pretty weak. It should probably at least test whether the core samples end up in the same clusterings, though you have to match the cluster labels between runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that I'd be the best person to undertake a refactoring of the DBSCAN tests, though. If they aren't up to par it might be best addressed in a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a given random state the labels should be deterministic. The only difference here might be numerical precision errors, but they shouldn't be an issue, I'd think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can start by making this one better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that that blurb on determinism has been changed in master: http://scikit-learn.org/dev/modules/clustering.html#dbscan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And apologies: no random_state involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, the new explanation is much stronger (did the implementation change?).
You're right, the labels (as well as the core sample subset) are deterministic - I rewrote the test to validate this in the latest commit. Maybe at some point I can get around to refactoring some of the other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, the new explanation is much stronger (did the implementation change?).
The implementation didn't change. @jbednar, who was frustrated at the previous description's inaccuracy, is to thank.
Updated the new test to hopefully be a bit more rigorous. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
thx @naoyak |
…#8139) * Add DBSCAN support for additional metric params
…#8139) * Add DBSCAN support for additional metric params
…#8139) * Add DBSCAN support for additional metric params
…#8139) * Add DBSCAN support for additional metric params
…#8139) * Add DBSCAN support for additional metric params
…#8139) * Add DBSCAN support for additional metric params
Fixes #4520
metrics_params
argument inDBSCAN
(i.e. passing{'p': p}
when using 'minkowski' for the metric)