Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Clarify DBSCAN eps parameter misunderstanding #13563

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 3, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion doc/modules/clustering.rst
Original file line number Diff line number Diff line change
Expand Up @@ -752,6 +752,18 @@ Any core sample is part of a cluster, by definition. Any sample that is not a
core sample, and is at least ``eps`` in distance from any core sample, is
considered an outlier by the algorithm.

While the parameter ``min_samples`` primarily controls how tolerant the
algorithm is towards noise (on noisy and large data sets it may be desiable
to increase this parameter), the parameter ``eps`` is *crucial to choose
appropriately* for the data set and distance function and usually cannot be
left at the default value. It controls the local neighborhood of the points.
When chosen too small, most data will not be clustered at all (and labeled
as ``-1`` for "noise"). When chosen too large, it causes close clusters to
be merged into one cluster, and eventually the entire data set to be returned
as a single cluster. Some heuristics for choosing this parameter have been
discussed in literature, for example based on a knee in the nearest neighbor
distances plot (as discussed in the references below).

In the figure below, the color indicates cluster membership, with large circles
indicating core samples found by the algorithm. Smaller circles are non-core
samples that are still part of a cluster. Moreover, the outliers are indicated
Expand Down Expand Up @@ -793,7 +805,7 @@ by black points below.

This implementation is by default not memory efficient because it constructs
a full pairwise similarity matrix in the case where kd-trees or ball-trees cannot
be used (e.g. with sparse matrices). This matrix will consume n^2 floats.
be used (e.g., with sparse matrices). This matrix will consume n^2 floats.
A couple of mechanisms for getting around this are:

- A sparse radius neighborhood graph (where missing entries are presumed to
Expand All @@ -814,6 +826,10 @@ by black points below.
In Proceedings of the 2nd International Conference on Knowledge Discovery
and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996

* "DBSCAN revisited, revisited: why and how you should (still) use DBSCAN.
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).
In ACM Transactions on Database Systems (TODS), 42(3), 19.

.. _birch:

Birch
Expand Down
22 changes: 18 additions & 4 deletions sklearn/cluster/dbscan_.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,11 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', metric_params=None,
``metric='precomputed'``.

eps : float, optional
The maximum distance between two samples for them to be considered
as in the same neighborhood.
The maximum distance between two samples for one to be considered
as in the neighborhood of the other. This is not a maximum bound
on the distances of points within a cluster. This is the most
important DBSCAN parameter to choose appropriately for your data set
and distance function.

min_samples : int, optional
The number of samples (or total weight) in a neighborhood for a point
Expand Down Expand Up @@ -128,6 +131,10 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', metric_params=None,
Algorithm for Discovering Clusters in Large Spatial Databases with Noise".
In: Proceedings of the 2nd International Conference on Knowledge Discovery
and Data Mining, Portland, OR, AAAI Press, pp. 226-231. 1996

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).
DBSCAN revisited, revisited: why and how you should (still) use DBSCAN.
ACM Transactions on Database Systems (TODS), 42(3), 19.
"""
if not eps > 0.0:
raise ValueError("eps must be positive.")
Expand Down Expand Up @@ -194,8 +201,11 @@ class DBSCAN(BaseEstimator, ClusterMixin):
Parameters
----------
eps : float, optional
The maximum distance between two samples for them to be considered
as in the same neighborhood.
The maximum distance between two samples for one to be considered
as in the neighborhood of the other. This is not a maximum bound
on the distances of points within a cluster. This is the most
important DBSCAN parameter to choose appropriately for your data set
and distance function.

min_samples : int, optional
The number of samples (or total weight) in a neighborhood for a point
Expand Down Expand Up @@ -299,6 +309,10 @@ class DBSCAN(BaseEstimator, ClusterMixin):
Algorithm for Discovering Clusters in Large Spatial Databases with Noise".
In: Proceedings of the 2nd International Conference on Knowledge Discovery
and Data Mining, Portland, OR, AAAI Press, pp. 226-231. 1996

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).
DBSCAN revisited, revisited: why and how you should (still) use DBSCAN.
ACM Transactions on Database Systems (TODS), 42(3), 19.
"""

def __init__(self, eps=0.5, min_samples=5, metric='euclidean',
Expand Down