[MRG+1] LSHForest: sparse support and vectorised _find_longest_prefix_match #3991

jnothman · 2014-12-22T12:26:33Z

This adds sparse matrix support to LSHForest, vectorises calls to hasher.transform, and vectorises _find_longest_prefix_match over queries. These seem to speed things up a little, but the benchmark script does not provide very stable timings for me.

Some other query operations cannot be easily vectorised, such as gathering the set of candidates per query (which differ in cardinality). Unvectorised operations make sparse matrix calculations particularly inefficient (because extracting a single row is not especially cheap).

jnothman · 2014-12-22T20:48:41Z

Hmmm... it seems I've broken something in fdc1158 (of course, the commit labelled STY). It's non-essential anyway. For the moment I'll revert it.

…om_state

jnothman · 2014-12-22T20:50:03Z

And there seems to be some substantial speed-up in vectorising _find_longest_prefix_match

jnothman · 2014-12-22T22:27:08Z

At master, the benchmark (using the script as updated here, averaged over 10 evaluation loops) output includes

Sample size: 1000000
------------------------
LSHF parameters: n_estimators = 3, n_candidates = 50
Average time for lshf neighbor queries: 0.024s
LSHF parameters: n_estimators = 5, n_candidates = 70
Average time for lshf neighbor queries: 0.046s
LSHF parameters: n_estimators = 10, n_candidates = 100
Average time for lshf neighbor queries: 0.081s

Now the corresponding timings are:

Average time for lshf neighbor queries: 0.016s
Average time for lshf neighbor queries: 0.025s
Average time for lshf neighbor queries: 0.046s

coveralls · 2014-12-22T22:35:17Z

Coverage increased (+0.01%) when pulling 91c2386 on jnothman:lshforest_improvements into 2d6f1c3 on scikit-learn:master.

jnothman · 2014-12-23T05:51:42Z

I've played around with vectorising distance calculations (see jnothman@bb244c7) and at least in the dense case where number of samples and candidates get big, it is slower than the baseline. It might be faster if queries were batched.

jnothman · 2014-12-24T01:20:27Z

@maheshakya, you might want to look at this.

ogrisel · 2014-12-24T14:03:27Z

What does STY stand for?

ogrisel · 2014-12-24T14:06:30Z

@daniel-vainsencher you might be interested in this as well.

ogrisel · 2014-12-24T14:16:19Z

sklearn/neighbors/approximate.py

-            res = mid
-        else:
-            hi = mid
+    hi = np.empty_like(query, dtype=int)


I think it's better to use dtype=np.intp for index arrays.

Okay. I actually wasn't sure what the policy for dtype=int is.

ogrisel · 2014-12-24T14:31:37Z

LGTM! Thanks for the optim & sparse support!

daniel-vainsencher · 2014-12-24T15:04:48Z

Hi everyone, lost track for a bit.

_find_longest_prefix_match, while part of the canon (the LSHF paper), can probably be simplified away eventually.

I don't think that assuming queries are batched is particularly likely to help much with LSHF... did that get significant speedup?

maheshakya · 2014-12-24T17:08:53Z

sklearn/neighbors/approximate.py

            max_depth = max_depth - 1
+            candidate_set.update(candidates)


I think now this only requires candidate_set.update(self.original_indices_[i][start:stop]) in the loop if the candidates is to be dropped.

And candidate_set can go just by the name candidates.

But again it's still worth to compare the costs of extending a list several times and updating a set.

candidate_set gives the sense of distinct as opposed to the number that min_candidates needs to be compared to.

maheshakya · 2014-12-24T17:15:42Z

@jnothman thanks for adding sparse support. I too have tried vectorizing distance calculations as you've mentioned but didn't get any speed up. And as Daniel said we cannot always expect batched queries in applications.

jnothman · 2014-12-24T21:05:22Z

I don't think that assuming queries are batched is particularly likely to help much with LSHF... did that get significant speedup?

Most of the gain shown in #3991 (comment) is from vectorizing _find_longest_prefix_match, as far as I can tell. I realise it may disappear in the future, but Cython need not be our first resort. Sparse matrices are slow to iterate over rows, so the benefit there from other vectorisation (transform in batch; distance calculation) may be greater, but I've not yet updated the benchmark script to report sparse performance.

And as Daniel said we cannot always expect batched queries in applications.

Why not? Queries are batched in {K,Radius}Neighbors{Classifier,Regressor} and in DBSCAN after to #3994. In general, the mode of operation in numpy/scipy (and matlab etc) is that efficiency will come through batched operations, and if you cannot exploit batches, you suffer some overhead. So if you can write code that benefits batches, you do.

jnothman · 2014-12-25T01:34:46Z

What does STY stand for?

Is Style... I've seen it somewhere else... it's a less confusing version of COSMIT, but i'm not sure why I used it.

jnothman · 2014-12-25T01:46:32Z

I'm not certain candidates (for the purpose of deciding if min_candidates is met) is being calculated correctly in the "synchronous ascending" phase.

@maheshakya, your clarification would be welcome.

Let's say we have a single tree. Currently candidates will be populated first with the candidates descending from max_depth matching bits. Then those descending from max_depth - 1 will be appended. Should this really be duplicating those elements found in the first iteration? Or should the duplicates in candidates only be due to different trees?

jnothman · 2014-12-25T02:50:32Z

Apart from that and the scipy.sparse.rand(..., random_state) issue, comments are addressed

maheshakya · 2014-12-25T05:31:52Z

Duplication of elements between iterations of max_depth is always happening. If I write down sync-ascending phase:

while (x > 0 and (|P | < cl or |distinct(P )| < m)) {
     for (i = 1 ; i ≤ l ; i + +){
         if (x[i] == x) {
                P = P ∪ Descendants(s[i])
                s[i] = Parent(s[i])
                x[i] = x[i] − 1
         }
     }
     x = x − 1
   }

Where P is candidates and x is max_depth. For all iterations over max_depth, all descendants of a particular node is added to candidates list. When max_depth <- max_depth -1, descendants of the parent of the earlier node will be added. Since descendants of a child are also descendants of its' parent, those elements will be added again to the list of considered candidates.
So I think, to represent this total number of candidates(with duplications), maintaining n_candidates is better than extending a list many times.

Duplication due to different trees is also possible.

maheshakya · 2014-12-25T05:51:42Z

sklearn/neighbors/approximate.py

@@ -319,7 +343,7 @@ def fit(self, X):
            Returns self.


Can you add a word about support for sparse (CSR) matrix in X?

and in kneighbors and radius_neighbors as well.

I hope my changes are adequate.

jnothman · 2014-12-25T09:47:59Z

Duplication of elements between iterations of max_depth is always happening

Yes, I've looked at that algorithm. But I find the ∪ notation confusing as this usually corresponds to sets, not concatenation. I'm happy with what we've got anyway.

coveralls · 2014-12-25T10:02:01Z

Coverage increased (+0.01%) when pulling 0e0e545 on jnothman:lshforest_improvements into 2d6f1c3 on scikit-learn:master.

ogrisel · 2014-12-29T12:13:11Z

@maheshakya @daniel-vainsencher any further comments?

daniel-vainsencher · 2014-12-29T12:26:47Z

When I said that _find_longest_prefix_match may disappear, I mean that I have a different way of doing the whole query that I think will be faster and much simpler. But since that is so far untested, tweaking the current algorithms is not bad.

The topic of when exactly to eliminate duplication is a bit tricky, I don't have any obviously good advice. So, not much to contribute at this time...

jnothman · 2014-12-29T12:42:18Z

When I said that _find_longest_prefix_match may disappear, I mean that I have a different way of doing the whole query that I think will be faster and much simpler

Do you mean a technique other than LSHForest with sorted arrays? Or a faster implementation of the latter? I know there are faster ways to implement it using Cython, but I'd rather see what we can get while staying as native as possible.

coveralls · 2014-12-29T12:50:23Z

Coverage increased (+0.04%) when pulling 4735caa on jnothman:lshforest_improvements into 2d6f1c3 on scikit-learn:master.

daniel-vainsencher · 2014-12-29T13:40:03Z

Hi Joel,

I have a bunch of related ideas on how to speed this up while
essentially retaining the data-structure, but changing the algorithms
(not just micro optimizations).

Maheshakya, Robert and I wanted to explore some of them, and try to get
the fastest ANN in the west, in Python! (and maybe publish it).

If you feel like trying one out (described for a single index, for
simplicity):

Stop treating the sorted arrays as trees

Use the same data structure, binary search for the whole query (not just
a prefix) for the location (denote l) the query would have in the array.
Then take the min_candidates directly before and after l as the initial
set of candidates.

Disadvantage: you now have 2x as many candidates as you wanted; if your
query was X01111, then up to half might be X10000 and thus have much
higher hamming distance than you were aiming for. And calculating true
distances for bogus candidates is expensive.
Solution 1 (of n): before using the true (say, Euclidean) distance, take
the best min_candidates by Hamming distance to binary query.

Advantages:

Just one (max two) binary searches per tree.
Add all candidates to the list at once.
Simple (to start with)

Anyway, this is very much experimental, so I'd open a separate PR.

Daniel

On 12/29/2014 02:42 PM, jnothman wrote:

When I said that _find_longest_prefix_match may disappear, I mean
that I have a different way of doing the whole query that I think
will be faster and much simpler
Do you mean a technique other than |LSHForest| with sorted arrays? Or a
faster implementation of the latter? I know there are faster ways to
implement it using Cython, but I'd rather see what we can get while
staying as native as possible.

—
Reply to this email directly or view it on GitHub
#3991 (comment).

ogrisel · 2014-12-29T20:35:24Z

Alright. Let's merge this PR then an explore with experimental algorithmic improvements in other PR (or maybe even in a 3rd party project if it's very different from the published method).

[MRG+1] LSHForest: sparse support and vectorised _find_longest_prefix_match

ogrisel · 2014-12-29T20:35:49Z

Thanks @jnothman!

ogrisel · 2014-12-29T20:37:22Z

Disadvantage: you now have 2x as many candidates as you wanted;

@daniel-vainsencher why not just collect min_samples / 2 before and after then? I must have missed something.

jnothman · 2014-12-29T20:59:40Z

@daniel-vainsencher, I've seen similar approaches used in applied literature (don't ask me where) that just take a quantity of context. It may be fairer to take min_candidates context, then take any further that have the same size prefix overlap (or do we rely on tree redundancy for that?). But now you've got me wondering whether that's actually equivalent to what we're doing...

daniel-vainsencher · 2014-12-29T21:02:13Z

Again, consider a query X01111, then the entries before it will have
codes like X01111, X01110, X01101 etc, and that is fine. The entries
after will have X10000, X10001, X10010 etc, which are quite far in
Hamming distance, therefore the corresponding points are far in expected
"true" distance. What this means is that under the proposed scheme, up
to half of the candidates found will be somewhat bad ones. This isn't
necessarily terrible: the Hamming "damage" on that half is n with
probability 2^-n for random queries... but the worst case is half your
candidates are useless.

By taking 2x min_candidates, you get precision that is always as good as
we got from LSHF with min_candidates; then the issue is "merely" to get
back the wasted time. This is a significant issue: the distance
calculations in high dimension are often the most expensive part!

Hence I gave the simplest of many candidates routes to deal with the
issue. Since I have never implemented these ideas, I don't know which
routes will work well, but calculating hamming distances doesn't get
more expensive with data dimension, so I'm optimistic.

Daniel

On 12/29/2014 10:37 PM, Olivier Grisel wrote:

Disadvantage: you now have 2x as many candidates as you wanted;
@daniel-vainsencher https://github.com/daniel-vainsencher why not just
collect |min_samples / 2| before and after then? I must have missed
something.

—
Reply to this email directly or view it on GitHub
#3991 (comment).

jnothman · 2014-12-29T21:09:07Z

Sorry, I mean with the min_hash_size check as well. I guess it remains a problem in that one direction will be searched and the other blocked.

daniel-vainsencher · 2014-12-29T21:20:35Z

@jnothman, I didn't understood exactly what you meant, but there are
many ways to choose what candidates to take.

My main points are:

To make the most of the prefixes and sortedness, we need only one
binary search, and then can constrain ourselves to a local area around it.
After having done that, actual hamming distance is a truer
approximation of "true" distance than shared prefix length.

On 12/29/2014 11:00 PM, jnothman wrote:

@daniel-vainsencher https://github.com/daniel-vainsencher, I've seen
similar approaches used in applied literature (don't ask me where) that
just take a quantity of context. It may be fairer to take
|min_candidates| context, then take any further that have the same size
prefix overlap (or do we rely on tree redundancy for that?). But now
you've got me wondering whether that's actually equivalent to what we're
doing...

—
Reply to this email directly or view it on GitHub
#3991 (comment).

jnothman · 2014-12-30T02:25:00Z

To make the most of the prefixes and sortedness, we need only one binary search, and then can constrain ourselves to a local area around it.

Of course. The series of binary searches is obviously unnecessary, even in the forest-over-sorted-arrays framework. But as far as I can determine we're not going to get a much faster implementation within the native Python / numpy framework.

After having done that, actual hamming distance is a truer approximation of "true" distance than shared prefix length

Yes, and I think that there should be an option to calculate these distances rather than the exact metric, as in the last point at #3988.

I don't think it would be wise to venture into new algorithm territory for this implementation, though; only new efficiency strategies. In terms of the state of the LSHForest implementation, I think the priorities are:

Support Euclidean approximation and make it default, ideally before the next scikit-learn release.
Ensure that LSHForest is competitive or faster than exact nearest neighbors for some KNeighborClassification task or similar.
make it flexible to user metrics and hashers.
Optimise it more.

So this PR was attempting to work towards the modest goal of 2.

jnothman added 2 commits December 22, 2014 15:58

ENH support sparse matrices in LSHForest

e736d76

ENH vectorise _find_longest_prefix_match

c53ff7c

TST LSHForest benchmark script uses vectorised queries and fixed rand…

ed5c63f

…om_state

jnothman force-pushed the lshforest_improvements branch from fd61994 to ed5c63f Compare December 22, 2014 20:49

jnothman added 2 commits December 23, 2014 08:40

FIX not all scipy.sparse support rand with random_state

4f2d27b

STY cleaning some LSHForest query code

91c2386

jnothman changed the title ~~LSHForest: sparse support and vectorised _find_longest_prefix_match~~ [MRG] LSHForest: sparse support and vectorised _find_longest_prefix_match Dec 22, 2014

ogrisel reviewed Dec 24, 2014
View reviewed changes

ogrisel changed the title ~~[MRG] LSHForest: sparse support and vectorised _find_longest_prefix_match~~ [MRG+1] LSHForest: sparse support and vectorised _find_longest_prefix_match Dec 24, 2014

maheshakya reviewed Dec 24, 2014
View reviewed changes

COSMIT Minor changes to address review comments

a394654

maheshakya reviewed Dec 25, 2014
View reviewed changes

DOC mention sparse matrix in LSHForest param docstrings

0e0e545

DOC add note on scipy.sparse.rand use

4735caa

ogrisel added a commit that referenced this pull request Dec 29, 2014

Merge pull request #3991 from jnothman/lshforest_improvements

67ca4ef

[MRG+1] LSHForest: sparse support and vectorised _find_longest_prefix_match

ogrisel merged commit 67ca4ef into scikit-learn:master Dec 29, 2014

Uh oh!

[MRG+1] LSHForest: sparse support and vectorised _find_longest_prefix_match #3991

[MRG+1] LSHForest: sparse support and vectorised _find_longest_prefix_match #3991

Uh oh!

Conversation

jnothman commented Dec 22, 2014

Uh oh!

jnothman commented Dec 22, 2014

Uh oh!

jnothman commented Dec 22, 2014

Uh oh!

jnothman commented Dec 22, 2014

Uh oh!

coveralls commented Dec 22, 2014

Uh oh!

jnothman commented Dec 23, 2014

Uh oh!

jnothman commented Dec 24, 2014

Uh oh!

ogrisel commented Dec 24, 2014

Uh oh!

ogrisel commented Dec 24, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Dec 24, 2014

Uh oh!

daniel-vainsencher commented Dec 24, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maheshakya commented Dec 24, 2014

Uh oh!

jnothman commented Dec 24, 2014

Uh oh!

jnothman commented Dec 25, 2014

Uh oh!

jnothman commented Dec 25, 2014

Uh oh!

jnothman commented Dec 25, 2014

Uh oh!

maheshakya commented Dec 25, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Dec 25, 2014

Uh oh!

coveralls commented Dec 25, 2014

Uh oh!

ogrisel commented Dec 29, 2014

Uh oh!

daniel-vainsencher commented Dec 29, 2014

Uh oh!

jnothman commented Dec 29, 2014

Uh oh!

coveralls commented Dec 29, 2014

Uh oh!

daniel-vainsencher commented Dec 29, 2014

Uh oh!

ogrisel commented Dec 29, 2014

Uh oh!

ogrisel commented Dec 29, 2014

Uh oh!

ogrisel commented Dec 29, 2014

Uh oh!

jnothman commented Dec 29, 2014