Thanks to visit codestin.com
Credit goes to github.com

Skip to content

DBSCAN too slow and consumes too much memory for large datasets: a simple tweak can fix this. #17650

Open
@jenniferjang

Description

@jenniferjang

Describe the workflow you want to enable

Currently, DBSCAN is very slow for large datasets and can use a lot of memory, especially in higher dimensions.

For example, running sklearn.cluster.DBSCAN on a size 745k dataset with 411 features took over 6 days to finish running on a powerful cloud machine, and running on a size ~1M dataset with a reasonable setting of epsilon for that dataset ran out of memory on that machine with over 750GB RAM.

In a very recent work I did with Google Research [1], we found that you can get over 200x speedup and 250x savings in memory based on a minor tweak to the DBSCAN algorithm without sacrificing clustering accuracy.

[1] https://arxiv.org/abs/2006.06743

Describe your proposed solution

The solution is based on a simple modification to the brute force version. Instead of calculating the pairwise distances for all pairs of examples to obtain the nearest-neighbor graph, you can instead only calculate it for a fraction s of the pairs (selected randomly) and clusters based on that subsampled graph. This would make the algorithm feasible for larger datasets, without hurting the clustering quality (in some cases s = 0.001 suffices). The idea is that you don’t actually need the full nearest-neighbor graph to get the desired clustering.

All comparisons in [1] were done against the algorithm=’auto’ setting of sklearn.cluster.DBSCAN.

The only new parameter needed would be 0 < s <= 1, the sampling ratio which would only be applicable for the algorithm=’brute-force’ setting. (Perhaps we might also want a random seed parameter if one wants to be able to have reproducible runs).

Describe alternatives you've considered, if relevant

Additional context

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions