-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG + 1] Do not shuffle by default for DBSCAN. #4066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The generator used to shuffle the samples. Defaults to numpy.random. | ||
The generator used to shuffle the samples, which affects the cluster | ||
numbering and cluster assignments of points that are border points to | ||
more than one cluster. Defaults to not shuffling. None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the convention used elsewhere; rather, a separate shuffle
parameter is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should I rename it to "shuffle" then instead of random_state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I mean for good or bad, random_state=None
means use an arbitrary random number generator, while an additional parameter controls whether randomness is used at all! See for instance the cross_validation
module or SGD*
.
I'm not sure whether there are concerns about backwards compatibility regarding making shuffle False by default. As you say, it's mostly deterministic (and I had wondered whether it would make sense in the batch-computed approach to work from densest to sparsest core samples). Regarding the complexity issue. Do you find the new implementation prohibitively costly for datasets that were fine under the previous implementation? This sort of trade-off seems to me quite common in interpreted numerical processing (where speed is obtained through vectorized (native code) bulk operations), so I wasn't concerned in making that change for the sake of substantial speed-up (which can be further improved upon, mind you, but only if done in bulk). However if you have a real concern, we might be able to find a compromise solution that works in batches, but the second-order lookup means the code will be messy. Or we might decide that the previous implementation, albeit somewhat slow, was fine. |
It may well be acceptable. I have not benchmarked. How much speedup does vectorization give in neighbors_model.radius_neighbors, which is probably the only really costly part? I'd suggest to drop the random_state parameter then completely. People may think that random_state has a similar impact as with k-means, but it doesn't matter much. If someone really wants to experiment with shuffled data, he can just shuffle the data prior to running DBSCAN. |
I think that's an interesting proposal, but we would need some kind of deprecation strategy. @robertlayton wdyt about removing randomisation from DBSCAN on the basis that it is deterministic except in rare edge cases?
Not a great lot, it seems, as we move asymptotic. Maybe I should reevaluate those changes in implementation. In the meantime, perhaps your note is apt. |
I agree that the algorithm is "mostly" deterministic. However, the trend is to perform shuffling within the classifier, rather than out of it. For that reason, I would recommend leaving the random_state parameter in tack, and providing an option |
Is random_state=False a convention used elsewhere in the package? On 9 January 2015 at 08:24, Robert Layton [email protected] wrote:
|
I don't think so, and I don't really like it. Maybe people don't |
I think shuffling inside estimators for stochastic algorithms is basically mandatory, as in SGDClassifier.
|
Let alone False, None and 0. On 9 January 2015 at 08:53, Gael Varoquaux [email protected] wrote:
|
Note that since the changes I made the other week change what is being On 9 January 2015 at 11:26, Joel Nothman [email protected] wrote:
|
What is the preferred way of warning of the removed parameter in scipy? I do not think we should add another option that does not help the user get better results. It at most changes a few border points, this will not increase the overall performance. Having the option will only make users assume this is another knob to tune. For compatibility, it makes sense to keep the parameter, and either silently ignore it, or warn if it is set. Indeed, the changes by @jnothman already changed the shuffling compared to previous versions. |
I would warn if shuffle is True or random_state is not None. |
@GaelVaroquaux there is currently no Ok then let's not introduce one, and if anyone set's random state we raise a deprecation warning and don't shuffle? |
Yes. That sounds good to me.
Ping @jnothman and @robertlayton. Do you have an idea? |
Not a strong idea, but reasoning roughly: The algorithm calculates core samples depending only on neighborhood density; and assigns distinct labels to connected components of the distance < eps graph among core samples*. It is the non-core samples (which means in areas of low density relative to model parameters) that may be within eps of multiple core samples, which need to be > eps from each other in order for there to be label ambiguity. But presumably these points are relatively rare, in that they lie between two areas of sufficiently high density, but are not in one themselves. @kno10's reference to "except for rare border cases" implies this has been more robustly analysed somewhere, and I would be glad for a reference before making any rash decisions. (*) this makes me now think the implementation can be easily made still-faster - i.e. dropping any Python loops - with |
I have such an implementation at https://github.com/jnothman/scikit-learn/tree/dbscan_vec2 which happens to assign peripheral points to the cluster of the nearest core sample rather than the first in a shuffled order. |
Does it lead to computational speed ups? |
|
+1 |
The original DBSCAN publication specifies "it might happen that some point p belongs to both, C1 and C2. [...] In this case, point p will be assigned to the cluster discovered first. Except from these rare situations, the result of DBSCAN is independent of the order in which the points of the database are visited [...]" |
With #4009 merged, the calculation of radius neighbors becomes parallelisable which means that this can be sped up close to n-cores times. That's certainly something I'll want in my use of DBSCAN, and it is not possible when querying one point at a time (although conceivably we could parallelise over points in the visited sample's neighborhood, to much less gain per overhead). IMO, using connected components means that the code is much easier to read rather than looking at nested loops and trying to understand their invariants. |
But I guess one can get the n_cores speed-up by calculating the complete pairwise distance matrix, but the memory usage is much more concerning then. |
But I aim to give you benchmarks of the improvement without parallelism on a real dataset I'm using. |
Some benchmarks. I should note in advance that a major reason for rewriting the dbscan code is that iterating over rows of a sparse matrix is a lot slower than over a dense matrix. You will see this effect below. I'm comparing a version of the scikit-learn 0.15 My input is an array of (7737, 100) minhashes that I am comparing with hamming distance. They are weighted to avoid excess work for duplicate hashes. Because this can't test the effect on sparse matrices. This is obviously not a very large dataset, but is a realistic start to get some idea of what would be the best way to implement this. Experiment 1: what I actually want to do dbscan_.dbscan(sketch_array, eps=.3, min_samples=20, sample_weight=np.array(weights), metric='hamming') old: 55.2 s, new: 30.5s Experiment 2: the same, but with precomputed distance matrix (which takes 13s to compute) dbscan_.dbscan(dist, eps=.3, min_samples=20, sample_weight=np.array(weights), metric='precomputed') old: 762 ms, new: 4.14 s ... clearly there's something a bit odd happening here that should be checked out. But this does show that the 25s gain above comes from not querying each sample at a time. Experiment 3: use Euclidean distance, even though it's nonsense over this dataset, because it has a fast implementation and works for sparse dbscan_.dbscan(sketch_array, eps=1e6, min_samples=20, sample_weight=np.array(weights), metric='euclidean') old: 10.1s, new: 8.5s Experiment 4: same with sparse input dbscan_.dbscan(sparse.csr_matrix(sketch_array), eps=1e6, min_samples=20, sample_weight=np.array(weights), metric='euclidean') old: 3min 18s, 24.2s In summary, the main benefit of the new approach/es is not extracting and querying individual rows from the input, as well as having a much more succinct implementation. The main disadvantages are extra memory usage, and less direct comparability between the algorithm in the paper and the code. Clearly, the row extraction is very costly for sparse input. An alternative would be to special-case sparse input and compute the distance matrix first, or to suggest the use of 'precomputed' where memory allows it. I'm happy to revert much of #3994 and find another way to handle these slow cases if that is deemed appropriate, and better for uses where memory is an issue. |
@jnorman The current version in this branch still computes all neighborhoods in one pass via: Since finding the neighbors is 99% of the cost in my experience, I do see potential for speedup there; and the code remains still easy to map back to what is published as DBSCAN for a new reader. The iteration The patch proposed in this branch contains:
Discussing #3994 now: |
I realise you've made no substantive changes to the implementation. But you've highlighted a critique of #3994, which is why we're discussing it here. I've not benchmarked the #3994 code presently. I don't think the efficiency of the concatenation is an issue, but can be benchmarked if we get there. The only question is whether the memory trade-offs that your note highlights are worthwhile. I now suspect they are not, but that we might make it easier for a user to request that the matrix be precomputed (either only those neighbors within eps, or all pairs, which seems a much faster operation) rather than iterate through the dataset itself. |
On my test data sets (10k and 50k coordinates from Twitter, but "misusing" Euclidean distance), current head was not slower with a precomputed distance matrix, and 5x faster than 0.15.2. However, I was able to shave off another 20-40% with a different vectorization approach, which is in my patch-2 branch. Any ideas to further improve this version, before I do a pull request? |
(Travis CI build failure is due to also including a fix from #4073: min_pts does include the query point in DBSCAN - it is in the database, and thus returned by a range query). |
@@ -89,15 +95,15 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', | |||
""" | |||
if not eps > 0.0: | |||
raise ValueError("eps must be positive.") | |||
if random_state is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a deprecation warning and should say that it will be removed in 0.18, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment needs to be addressed before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed this has not been addressed yet.
I think I'd like to propose reverting to the previous implementation (or some cleaned up variant thereof with Basically, I think you're right that departing from the linear memory requirements for no great speed gains is a Bad Thing, given that passing a precomputed distance matrix is an option where memory permits. |
Have we found it's a lot slower given precomputed input? If not, then the changes provide no real efficiency advantage. |
Okay, given movements and discussions elsewhere, I think we shouldn't revert anything. Yes, we should probably add a note about higher space complexity than the traditional algorithm. And perhaps @larsmans as a DBSCAN user has some input on turning shuffling off by default. |
I haven't seen it matter on real-world data yet, and I doubt it will. I have noticed that the batch distance computations can be problematic though, with machines locking up and all the assorted nastiness if the parameters are not set properly. |
You mean if the radius is unreasonably big for the data? Maybe we should have an option in Are we better off doing something that doesn't require batch computation, but allows the user to pass in a precomputed |
I'm not sure I'm up to date on the DBSCAN reimplementation discussion. Is this PR still relevant? Or do we want to refactor anyhow? |
This PR is still relevant, I think. On 4 March 2015 at 09:24, Andreas Mueller [email protected] wrote:
|
This PR still applies to current head. It may be best to merge the "remove shuffling, add warning" patch early, if we want to eventually remove it altogether. Even if a redesign will eventually happen. But I lost track of what was the latest/fastest version of DBSCAN without reengineering everything... my fastest pure-python version was e48ade5 |
Ok, so if we should just merge the "remove shuffle, add warning", could you please rebase? And it looks like Travis was not happy. |
random_state : numpy.RandomState, optional | ||
The generator used to shuffle the samples. Defaults to numpy.random. | ||
random_state: numpy.RandomState, optional | ||
Not supported (DBSCAN does not use random initialization). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably just say "ignored"
#4151 was merged, #4157 is awaiting review. On 4 March 2015 at 09:58, Erich Schubert [email protected] wrote:
|
This makes little difference, and original DBSCAN did not shuffle. Warn if `random_state` is used. As is `random_state` encourages users to experiment with different randomization, as you would do with k-means. But in contrast to k-means, the output of DBSCAN is deterministic except for cluster enumeration and "rare" cases, where a point is on the border of two clusters at the same time. As this affects single points only, the measureable performance difference will be close to zero. Also, incorporate fix for minpts including the query point.
I have rebased the patch. It already incorporated a fix for #4073 (DBSCAN includes the query point when counting neighbors) but not the updated unit test, which I cherry-picked from #4073. Travis CI is failing with "Unable to connect to www.rabbitmq.com:http:" which is down for me, too (but not my fault). |
LGTM |
@@ -89,15 +96,16 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', | |||
""" | |||
if not eps > 0.0: | |||
raise ValueError("eps must be positive.") | |||
if random_state is not None: | |||
warnings.warn("The parameter random_state is ignored " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: there is no need for the +
sign here.
Alright this looks good to me as well. I will fix the style / deprecation warning issues when merging. Let's move the discussions on the algorithm considerations (space complexity, speed with sparse data, option to precompute distances, Cython version) on to dedicated PRs and or issues. |
I rebased, fixed deprecation messages, added a whats new entry and pushed to master. Thanks everyone. |
Hi, I just bumped into this coming from here: #5275 and a real-world example of where the higher memory complexity seems to matter are GPS traces. |
Shuffling is not necessary, the effect on the result is usually nonexistant (except for permuted cluster numbering); DBSCAN is mostly deterministic except for rare border cases.
Add a note about the increased memory complexity of this implementation compared to original DBSCAN.