Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG + 1] Do not shuffle by default for DBSCAN. #4066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

[MRG + 1] Do not shuffle by default for DBSCAN. #4066

wants to merge 2 commits into from

Conversation

kno10
Copy link
Contributor

@kno10 kno10 commented Jan 8, 2015

Shuffling is not necessary, the effect on the result is usually nonexistant (except for permuted cluster numbering); DBSCAN is mostly deterministic except for rare border cases.

Add a note about the increased memory complexity of this implementation compared to original DBSCAN.

The generator used to shuffle the samples. Defaults to numpy.random.
The generator used to shuffle the samples, which affects the cluster
numbering and cluster assignments of points that are border points to
more than one cluster. Defaults to not shuffling. None.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the convention used elsewhere; rather, a separate shuffle parameter is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I rename it to "shuffle" then instead of random_state?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean for good or bad, random_state=None means use an arbitrary random number generator, while an additional parameter controls whether randomness is used at all! See for instance the cross_validation module or SGD*.

@jnothman
Copy link
Member

jnothman commented Jan 8, 2015

I'm not sure whether there are concerns about backwards compatibility regarding making shuffle False by default. As you say, it's mostly deterministic (and I had wondered whether it would make sense in the batch-computed approach to work from densest to sparsest core samples).

Regarding the complexity issue. Do you find the new implementation prohibitively costly for datasets that were fine under the previous implementation? This sort of trade-off seems to me quite common in interpreted numerical processing (where speed is obtained through vectorized (native code) bulk operations), so I wasn't concerned in making that change for the sake of substantial speed-up (which can be further improved upon, mind you, but only if done in bulk).

However if you have a real concern, we might be able to find a compromise solution that works in batches, but the second-order lookup means the code will be messy. Or we might decide that the previous implementation, albeit somewhat slow, was fine.

@kno10
Copy link
Contributor Author

kno10 commented Jan 8, 2015

It may well be acceptable. I have not benchmarked. How much speedup does vectorization give in neighbors_model.radius_neighbors, which is probably the only really costly part?

I'd suggest to drop the random_state parameter then completely. People may think that random_state has a similar impact as with k-means, but it doesn't matter much. If someone really wants to experiment with shuffled data, he can just shuffle the data prior to running DBSCAN.

@jnothman
Copy link
Member

jnothman commented Jan 8, 2015

I think that's an interesting proposal, but we would need some kind of deprecation strategy. @robertlayton wdyt about removing randomisation from DBSCAN on the basis that it is deterministic except in rare edge cases?

How much speedup does vectorization give in neighbors_model.radius_neighbors, which is probably the only really costly part?

Not a great lot, it seems, as we move asymptotic. Maybe I should reevaluate those changes in implementation. In the meantime, perhaps your note is apt.

@robertlayton
Copy link
Member

I agree that the algorithm is "mostly" deterministic. However, the trend is to perform shuffling within the classifier, rather than out of it. For that reason, I would recommend leaving the random_state parameter in tack, and providing an option random_state=False, which doesn't shuffle. I don't mind if it is False by default, just do a warning if called without setting random_state for one version or so?

@jnothman
Copy link
Member

jnothman commented Jan 8, 2015

Is random_state=False a convention used elsewhere in the package?

On 9 January 2015 at 08:24, Robert Layton [email protected] wrote:

I agree that the algorithm is "mostly" deterministic. However, the trend
is to perform shuffling within the classifier, rather than out of it. For
that reason, I would recommend leaving the random_state parameter in tack,
and providing an option random_state=False, which doesn't shuffle. I
don't mind if it is False by default, just do a warning if called without
setting random_state for one version or so?


Reply to this email directly or view it on GitHub
#4066 (comment)
.

@GaelVaroquaux
Copy link
Member

Is random_state=False a convention used elsewhere in the package?

I don't think so, and I don't really like it. Maybe people don't
understand well the difference between False and None.

@amueller
Copy link
Member

amueller commented Jan 8, 2015

I think shuffling inside estimators for stochastic algorithms is basically mandatory, as in SGDClassifier.
Here it seems not so important.
I am against random_state=False. The two options are

  1. Deprecate shuffling, that is don't shuffle and warn if random_state is not None, later remove random_state.
  2. Make shuffling optional, that is introduce a boolean shuffle=False (or =True?)

@jnothman
Copy link
Member

jnothman commented Jan 9, 2015

Let alone False, None and 0.

On 9 January 2015 at 08:53, Gael Varoquaux [email protected] wrote:

Is random_state=False a convention used elsewhere in the package?

I don't think so, and I don't really like it. Maybe people don't
understand well the difference between False and None.


Reply to this email directly or view it on GitHub
#4066 (comment)
.

@jnothman
Copy link
Member

jnothman commented Jan 9, 2015

Note that since the changes I made the other week change what is being
shuffled (core samples only), there are no greater backwards compatibility
issues in making shuffle off by default.

On 9 January 2015 at 11:26, Joel Nothman [email protected] wrote:

Let alone False, None and 0.

On 9 January 2015 at 08:53, Gael Varoquaux [email protected]
wrote:

Is random_state=False a convention used elsewhere in the package?

I don't think so, and I don't really like it. Maybe people don't
understand well the difference between False and None.


Reply to this email directly or view it on GitHub
#4066 (comment)
.

@kno10
Copy link
Contributor Author

kno10 commented Jan 9, 2015

What is the preferred way of warning of the removed parameter in scipy?
The latest version of the patch silently ignores the random_state parameter.

I do not think we should add another option that does not help the user get better results. It at most changes a few border points, this will not increase the overall performance. Having the option will only make users assume this is another knob to tune. For compatibility, it makes sense to keep the parameter, and either silently ignore it, or warn if it is set.

Indeed, the changes by @jnothman already changed the shuffling compared to previous versions.

@GaelVaroquaux
Copy link
Member

What is the preferred way of warning of the removed parameter in scipy?
The latest version of the patch silently ignores the random_state parameter.

I would warn if shuffle is True or random_state is not None.

@amueller
Copy link
Member

amueller commented Jan 9, 2015

@GaelVaroquaux there is currently no shuffle parameter ;)

Ok then let's not introduce one, and if anyone set's random state we raise a deprecation warning and don't shuffle?
I have to say I can't judge the impact of shuffling in this algorithm, so someone who is more familiar should confirm that this doesn't usually change results, even if the data is ordered in some way.

@GaelVaroquaux
Copy link
Member

Ok then let's not introduce one, and if anyone set's random state we raise a
deprecation warning and don't shuffle?

Yes. That sounds good to me.

I have to say I can't judge the impact of shuffling in this algorithm, so
someone who is more familiar should confirm that this doesn't usually change
results, even if the data is ordered in some way.

Ping @jnothman and @robertlayton. Do you have an idea?

@jnothman
Copy link
Member

I have to say I can't judge the impact of shuffling in this algorithm, so someone who is more familiar should confirm that this doesn't usually change results, even if the data is ordered in some way.

Ping @jnothman and @robertlayton. Do you have an idea?

Not a strong idea, but reasoning roughly: The algorithm calculates core samples depending only on neighborhood density; and assigns distinct labels to connected components of the distance < eps graph among core samples*. It is the non-core samples (which means in areas of low density relative to model parameters) that may be within eps of multiple core samples, which need to be > eps from each other in order for there to be label ambiguity. But presumably these points are relatively rare, in that they lie between two areas of sufficiently high density, but are not in one themselves. @kno10's reference to "except for rare border cases" implies this has been more robustly analysed somewhere, and I would be glad for a reference before making any rash decisions.

(*) this makes me now think the implementation can be easily made still-faster - i.e. dropping any Python loops - with scipy.sparse.csgraph.connected_components.

@jnothman
Copy link
Member

(*) this makes me now think the implementation can be easily made still-faster - i.e. dropping any Python loops - with scipy.sparse.csgraph.connected_components.

I have such an implementation at https://github.com/jnothman/scikit-learn/tree/dbscan_vec2 which happens to assign peripheral points to the cluster of the nearest core sample rather than the first in a shuffled order.

@GaelVaroquaux
Copy link
Member

I have such an implementation at
https://github.com/jnothman/scikit-learn/tree/ dbscan_vec2

Does it lead to computational speed ups?

@kno10
Copy link
Contributor Author

kno10 commented Jan 10, 2015

  1. Border points: IIRC this was discussed in literature. With minPts=2 they cannot occur at all, and get more frequent with increasing minPts. Collissions aren't necessarily more frequent at higher minPts, because there will be more noise. The non-deterministic case can sometimes happen when a cluster is splitting into two, but still "almost density connected".
  2. A connected-components based approach is possible (if done on the core points only, it does yield the same result - the authors did choose the name "density connected" not by chance).
    However, I'm concerned that the implementation now moves away from what was published as DBSCAN. There are many many variations over this algorithm. But if it ends up being a different algorithm, it should probably use a different name and attribution then...
  3. Performance: usually 95% of the time is finding the neighbors. So all possible improvements by delegating the computations into C/Cython/Fortran instead of interpreted Python (technically, it's not really "vectorization" anymore) are probably on the remaining 5%.
    Personally, I would prefer a more literal implementation of the original algorithm, unless the performance saving are very much measureable.

@GaelVaroquaux
Copy link
Member

 Personally, I would prefer a more literal implementation of the
 original algorithm, unless the performance saving are very much
 measureable.

+1

@kno10
Copy link
Contributor Author

kno10 commented Jan 10, 2015

The original DBSCAN publication specifies "it might happen that some point p belongs to both, C1 and C2. [...] In this case, point p will be assigned to the cluster discovered first. Except from these rare situations, the result of DBSCAN is independent of the order in which the points of the database are visited [...]"
So if you want an "exact" DBSCAN implementation, objects should be processed in the order of the database; and should not be shuffled randomly.
Given that the points where the result is non-deterministic are rare, they will not have a measureable impact on the evaluation performance of the algorithm.

@jnothman
Copy link
Member

Personally, I would prefer a more literal implementation of the original algorithm, unless the performance saving are very much measureable.

With #4009 merged, the calculation of radius neighbors becomes parallelisable which means that this can be sped up close to n-cores times. That's certainly something I'll want in my use of DBSCAN, and it is not possible when querying one point at a time (although conceivably we could parallelise over points in the visited sample's neighborhood, to much less gain per overhead).

IMO, using connected components means that the code is much easier to read rather than looking at nested loops and trying to understand their invariants.

@jnothman
Copy link
Member

But I guess one can get the n_cores speed-up by calculating the complete pairwise distance matrix, but the memory usage is much more concerning then.

@jnothman
Copy link
Member

But I aim to give you benchmarks of the improvement without parallelism on a real dataset I'm using.

@jnothman
Copy link
Member

Some benchmarks. I should note in advance that a major reason for rewriting the dbscan code is that iterating over rows of a sparse matrix is a lot slower than over a dense matrix. You will see this effect below.

I'm comparing a version of the scikit-learn 0.15 dbscan with sample weight and sparse support with that at https://github.com/jnothman/scikit-learn/tree/dbscan_vec2. Neither shuffles and both should give the same results.

My input is an array of (7737, 100) minhashes that I am comparing with hamming distance. They are weighted to avoid excess work for duplicate hashes. Because this can't test the effect on sparse matrices. This is obviously not a very large dataset, but is a realistic start to get some idea of what would be the best way to implement this.

Experiment 1: what I actually want to do

dbscan_.dbscan(sketch_array, eps=.3, min_samples=20, sample_weight=np.array(weights), metric='hamming')

old: 55.2 s, new: 30.5s

Experiment 2: the same, but with precomputed distance matrix (which takes 13s to compute)

dbscan_.dbscan(dist, eps=.3, min_samples=20, sample_weight=np.array(weights), metric='precomputed')

old: 762 ms, new: 4.14 s ... clearly there's something a bit odd happening here that should be checked out. But this does show that the 25s gain above comes from not querying each sample at a time.

Experiment 3: use Euclidean distance, even though it's nonsense over this dataset, because it has a fast implementation and works for sparse

dbscan_.dbscan(sketch_array, eps=1e6, min_samples=20, sample_weight=np.array(weights), metric='euclidean')

old: 10.1s, new: 8.5s

Experiment 4: same with sparse input

dbscan_.dbscan(sparse.csr_matrix(sketch_array), eps=1e6, min_samples=20, sample_weight=np.array(weights), metric='euclidean')

old: 3min 18s, 24.2s

In summary, the main benefit of the new approach/es is not extracting and querying individual rows from the input, as well as having a much more succinct implementation. The main disadvantages are extra memory usage, and less direct comparability between the algorithm in the paper and the code.

Clearly, the row extraction is very costly for sparse input. An alternative would be to special-case sparse input and compute the distance matrix first, or to suggest the use of 'precomputed' where memory allows it. I'm happy to revert much of #3994 and find another way to handle these slow cases if that is deemed appropriate, and better for uses where memory is an issue.

@kno10
Copy link
Contributor Author

kno10 commented Jan 11, 2015

@jnorman The current version in this branch still computes all neighborhoods in one pass via:
neighborhoods = neighbors_model.radius_neighbors(X, eps
thus, it should benefit from any parallelism introduced in NearestNeighbors, without using connected_components.

Since finding the neighbors is 99% of the cost in my experience, I do see potential for speedup there; and the code remains still easy to map back to what is published as DBSCAN for a new reader. The iteration for index in core_samples: skips non-core points, but this is also still easy to understand from DBSCAN. I also didn't change your bulk-operations in the neighborhood loop, although I'm not convinced they will give a speedup.

The patch proposed in this branch contains:

  • replace random_state functionality with warnings.warn().
  • update the documentation of random_state to reflect this change.
  • update the "Notes" __doc__ section to document the memory difference to published DBSCAN.
    Other than that it is what you committed to sklearn already ([MRG+1] DBSCAN: faster, weighted samples, and sparse input #3994).

Discussing #3994 now:
Your benchmarks do IMHO not support the "vectorization" changes inside the for index loop. Given that the old version takes 55.2s without and 0.762s (+13s) with a distance matrix, we should A) suggest to use a distance matrix when memory is sufficient (and maybe fall back to a O(n) memory approach otherwise), and B) we must assume that finding the neighbors takes 54.5s, and your patch was only able to shave off 25s of this. :-( It is to be expected that NearestNeighbors.radius_neighbors is 2x slower than the distance matrix, but the savings inside the while(len(candidates) > 0) may be negative in the worst case. Your code assumes that
np.concatenate(np.take(neighborhoods, candidates, axis=0).tolist()) is highly efficient. What if it isn't?

@jnothman
Copy link
Member

I realise you've made no substantive changes to the implementation. But you've highlighted a critique of #3994, which is why we're discussing it here. I've not benchmarked the #3994 code presently. I don't think the efficiency of the concatenation is an issue, but can be benchmarked if we get there.

The only question is whether the memory trade-offs that your note highlights are worthwhile. I now suspect they are not, but that we might make it easier for a user to request that the matrix be precomputed (either only those neighbors within eps, or all pairs, which seems a much faster operation) rather than iterate through the dataset itself.

@kno10
Copy link
Contributor Author

kno10 commented Jan 11, 2015

On my test data sets (10k and 50k coordinates from Twitter, but "misusing" Euclidean distance), current head was not slower with a precomputed distance matrix, and 5x faster than 0.15.2. However, I was able to shave off another 20-40% with a different vectorization approach, which is in my patch-2 branch.
On the 50k data set, run times were 11s (0.15.2), 2.1s (HEAD), 2.0s (patch-1), 1.2s (patch-2) - distance matrix fails with MemoryError.
Performance benchmarks vary a lot with parameters. In particular, iterating only over core points pay off best iff there are only a few core points.

Any ideas to further improve this version, before I do a pull request?

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 635ab93 on kno10:patch-1 into cdae4a4 on scikit-learn:master.

@kno10
Copy link
Contributor Author

kno10 commented Jan 14, 2015

(Travis CI build failure is due to also including a fix from #4073: min_pts does include the query point in DBSCAN - it is in the database, and thus returned by a range query).

@amueller amueller added this to the 0.16 milestone Jan 16, 2015
@@ -89,15 +95,15 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
"""
if not eps > 0.0:
raise ValueError("eps must be positive.")
if random_state is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a deprecation warning and should say that it will be removed in 0.18, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment needs to be addressed before merging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this has not been addressed yet.

@jnothman
Copy link
Member

@jnothman I'm not sure what you are proposing.

I think I'd like to propose reverting to the previous implementation (or some cleaned up variant thereof with sample_weight support), accepting sparse matrices as input but recommending that their distances be passed in precomputed if memory constraints permit.

Basically, I think you're right that departing from the linear memory requirements for no great speed gains is a Bad Thing, given that passing a precomputed distance matrix is an option where memory permits.

@jnothman
Copy link
Member

revert #3994 to only use O(n) memory (at the cost of being a lot slower)

Have we found it's a lot slower given precomputed input? If not, then the changes provide no real efficiency advantage.

@jnothman
Copy link
Member

jnothman commented Feb 7, 2015

Okay, given movements and discussions elsewhere, I think we shouldn't revert anything. Yes, we should probably add a note about higher space complexity than the traditional algorithm. And perhaps @larsmans as a DBSCAN user has some input on turning shuffling off by default.

@larsmans
Copy link
Member

larsmans commented Feb 7, 2015

I haven't seen it matter on real-world data yet, and I doubt it will. I have noticed that the batch distance computations can be problematic though, with machines locking up and all the assorted nastiness if the parameters are not set properly.

@jnothman
Copy link
Member

jnothman commented Feb 8, 2015

if the parameters are not set properly.

You mean if the radius is unreasonably big for the data? Maybe we should have an option in BinaryTree to raise an error if too many neighbors for any particular query... Won't help for brute force though.

Are we better off doing something that doesn't require batch computation, but allows the user to pass in a precomputed radius_neighbors_graph?

@amueller
Copy link
Member

amueller commented Mar 3, 2015

I'm not sure I'm up to date on the DBSCAN reimplementation discussion. Is this PR still relevant? Or do we want to refactor anyhow?

@jnothman
Copy link
Member

jnothman commented Mar 3, 2015

This PR is still relevant, I think.

On 4 March 2015 at 09:24, Andreas Mueller [email protected] wrote:

I'm not sure I'm up to date on the DBSCAN reimplementation discussion.
Is this PR still relevant? Or do we want to refactor anyhow?


Reply to this email directly or view it on GitHub
#4066 (comment)
.

@kno10
Copy link
Contributor Author

kno10 commented Mar 3, 2015

This PR still applies to current head.

It may be best to merge the "remove shuffling, add warning" patch early, if we want to eventually remove it altogether. Even if a redesign will eventually happen.

But I lost track of what was the latest/fastest version of DBSCAN without reengineering everything... my fastest pure-python version was e48ade5
I remember there was a Cython rewrite discussed, that is when I stopped following.

@amueller
Copy link
Member

amueller commented Mar 3, 2015

Ok, so if we should just merge the "remove shuffle, add warning", could you please rebase? And it looks like Travis was not happy.

random_state : numpy.RandomState, optional
The generator used to shuffle the samples. Defaults to numpy.random.
random_state: numpy.RandomState, optional
Not supported (DBSCAN does not use random initialization).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably just say "ignored"

@jnothman
Copy link
Member

jnothman commented Mar 3, 2015

#4151 was merged, #4157 is awaiting review.

On 4 March 2015 at 09:58, Erich Schubert [email protected] wrote:

This PR still applies to current head.

It may be best to merge the "remove shuffling, add warning" patch early,
if we want to eventually remove it altogether. Even if a redesign will
eventually happen.

But I lost track of what was the latest/fastest version of DBSCAN without
reengineering everything... my fastest pure-python version was e48ade5
e48ade5
I remember there was a Cython rewrite discussed, that is when I stopped
following.


Reply to this email directly or view it on GitHub
#4066 (comment)
.

This makes little difference, and original DBSCAN did not shuffle.
Warn if `random_state` is used.

As is `random_state` encourages users to experiment with different
randomization, as you would do with k-means. But in contrast to
k-means, the output of DBSCAN is deterministic except for cluster
enumeration and "rare" cases, where a point is on the border of
two clusters at the same time. As this affects single points only,
the measureable performance difference will be close to zero.

Also, incorporate fix for minpts including the query point.
@kno10
Copy link
Contributor Author

kno10 commented Mar 4, 2015

I have rebased the patch. It already incorporated a fix for #4073 (DBSCAN includes the query point when counting neighbors) but not the updated unit test, which I cherry-picked from #4073.

Travis CI is failing with "Unable to connect to www.rabbitmq.com:http:" which is down for me, too (but not my fault).

@amueller amueller changed the title Do not shuffle by default for DBSCAN. [MRG + 1] Do not shuffle by default for DBSCAN. Mar 4, 2015
@amueller
Copy link
Member

amueller commented Mar 4, 2015

LGTM

@@ -89,15 +96,16 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
"""
if not eps > 0.0:
raise ValueError("eps must be positive.")
if random_state is not None:
warnings.warn("The parameter random_state is ignored " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: there is no need for the + sign here.

@ogrisel
Copy link
Member

ogrisel commented Mar 5, 2015

Alright this looks good to me as well. I will fix the style / deprecation warning issues when merging. Let's move the discussions on the algorithm considerations (space complexity, speed with sparse data, option to precompute distances, Cython version) on to dedicated PRs and or issues.

@amueller
Copy link
Member

amueller commented Mar 5, 2015

Thanks @ogrisel and @kno10 :)

@ogrisel
Copy link
Member

ogrisel commented Mar 5, 2015

I rebased, fixed deprecation messages, added a whats new entry and pushed to master. Thanks everyone.

@ogrisel ogrisel closed this Mar 5, 2015
@kno10 kno10 deleted the patch-1 branch March 5, 2015 22:45
@cstich
Copy link

cstich commented Oct 12, 2015

Hi, I just bumped into this coming from here: #5275 and a real-world example of where the higher memory complexity seems to matter are GPS traces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants