Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] LSHForest: sparse support and vectorised _find_longest_prefix_match #3991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Dec 29, 2014

Conversation

jnothman
Copy link
Member

This adds sparse matrix support to LSHForest, vectorises calls to hasher.transform, and vectorises _find_longest_prefix_match over queries. These seem to speed things up a little, but the benchmark script does not provide very stable timings for me.

Some other query operations cannot be easily vectorised, such as gathering the set of candidates per query (which differ in cardinality). Unvectorised operations make sparse matrix calculations particularly inefficient (because extracting a single row is not especially cheap).

@jnothman
Copy link
Member Author

Hmmm... it seems I've broken something in fdc1158 (of course, the commit labelled STY). It's non-essential anyway. For the moment I'll revert it.

@jnothman jnothman force-pushed the lshforest_improvements branch from fd61994 to ed5c63f Compare December 22, 2014 20:49
@jnothman
Copy link
Member Author

And there seems to be some substantial speed-up in vectorising _find_longest_prefix_match

@jnothman jnothman changed the title LSHForest: sparse support and vectorised _find_longest_prefix_match [MRG] LSHForest: sparse support and vectorised _find_longest_prefix_match Dec 22, 2014
@jnothman
Copy link
Member Author

At master, the benchmark (using the script as updated here, averaged over 10 evaluation loops) output includes

Sample size: 1000000
------------------------
LSHF parameters: n_estimators = 3, n_candidates = 50
Average time for lshf neighbor queries: 0.024s
LSHF parameters: n_estimators = 5, n_candidates = 70
Average time for lshf neighbor queries: 0.046s
LSHF parameters: n_estimators = 10, n_candidates = 100
Average time for lshf neighbor queries: 0.081s

Now the corresponding timings are:

Average time for lshf neighbor queries: 0.016s
Average time for lshf neighbor queries: 0.025s
Average time for lshf neighbor queries: 0.046s

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) when pulling 91c2386 on jnothman:lshforest_improvements into 2d6f1c3 on scikit-learn:master.

@jnothman
Copy link
Member Author

I've played around with vectorising distance calculations (see jnothman@bb244c7) and at least in the dense case where number of samples and candidates get big, it is slower than the baseline. It might be faster if queries were batched.

@jnothman
Copy link
Member Author

@maheshakya, you might want to look at this.

@ogrisel
Copy link
Member

ogrisel commented Dec 24, 2014

What does STY stand for?

@ogrisel
Copy link
Member

ogrisel commented Dec 24, 2014

@daniel-vainsencher you might be interested in this as well.

res = mid
else:
hi = mid
hi = np.empty_like(query, dtype=int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to use dtype=np.intp for index arrays.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I actually wasn't sure what the policy for dtype=int is.

@ogrisel
Copy link
Member

ogrisel commented Dec 24, 2014

LGTM! Thanks for the optim & sparse support!

@ogrisel ogrisel changed the title [MRG] LSHForest: sparse support and vectorised _find_longest_prefix_match [MRG+1] LSHForest: sparse support and vectorised _find_longest_prefix_match Dec 24, 2014
@daniel-vainsencher
Copy link

Hi everyone, lost track for a bit.

_find_longest_prefix_match, while part of the canon (the LSHF paper), can probably be simplified away eventually.

I don't think that assuming queries are batched is particularly likely to help much with LSHF... did that get significant speedup?

max_depth = max_depth - 1
candidate_set.update(candidates)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now this only requires candidate_set.update(self.original_indices_[i][start:stop]) in the loop if the candidates is to be dropped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And candidate_set can go just by the name candidates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But again it's still worth to compare the costs of extending a list several times and updating a set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

candidate_set gives the sense of distinct as opposed to the number that min_candidates needs to be compared to.

@maheshakya
Copy link
Contributor

@jnothman thanks for adding sparse support. I too have tried vectorizing distance calculations as you've mentioned but didn't get any speed up. And as Daniel said we cannot always expect batched queries in applications.

@jnothman
Copy link
Member Author

I don't think that assuming queries are batched is particularly likely to help much with LSHF... did that get significant speedup?

Most of the gain shown in #3991 (comment) is from vectorizing _find_longest_prefix_match, as far as I can tell. I realise it may disappear in the future, but Cython need not be our first resort. Sparse matrices are slow to iterate over rows, so the benefit there from other vectorisation (transform in batch; distance calculation) may be greater, but I've not yet updated the benchmark script to report sparse performance.

And as Daniel said we cannot always expect batched queries in applications.

Why not? Queries are batched in {K,Radius}Neighbors{Classifier,Regressor} and in DBSCAN after to #3994. In general, the mode of operation in numpy/scipy (and matlab etc) is that efficiency will come through batched operations, and if you cannot exploit batches, you suffer some overhead. So if you can write code that benefits batches, you do.

@jnothman
Copy link
Member Author

What does STY stand for?

Is Style... I've seen it somewhere else... it's a less confusing version of COSMIT, but i'm not sure why I used it.

@jnothman
Copy link
Member Author

I'm not certain candidates (for the purpose of deciding if min_candidates is met) is being calculated correctly in the "synchronous ascending" phase.

@maheshakya, your clarification would be welcome.

Let's say we have a single tree. Currently candidates will be populated first with the candidates descending from max_depth matching bits. Then those descending from max_depth - 1 will be appended. Should this really be duplicating those elements found in the first iteration? Or should the duplicates in candidates only be due to different trees?

@jnothman
Copy link
Member Author

Apart from that and the scipy.sparse.rand(..., random_state) issue, comments are addressed

@maheshakya
Copy link
Contributor

Duplication of elements between iterations of max_depth is always happening. If I write down sync-ascending phase:

while (x > 0 and (|P | < cl or |distinct(P )| < m)) {
     for (i = 1 ; i ≤ l ; i + +){
         if (x[i] == x) {
                P = P ∪ Descendants(s[i])
                s[i] = Parent(s[i])
                x[i] = x[i] − 1
         }
     }
     x = x − 1
   }

Where P is candidates and x is max_depth. For all iterations over max_depth, all descendants of a particular node is added to candidates list. When max_depth <- max_depth -1, descendants of the parent of the earlier node will be added. Since descendants of a child are also descendants of its' parent, those elements will be added again to the list of considered candidates.
So I think, to represent this total number of candidates(with duplications), maintaining n_candidates is better than extending a list many times.

Duplication due to different trees is also possible.

@@ -319,7 +343,7 @@ def fit(self, X):
Returns self.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a word about support for sparse (CSR) matrix in X?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and in kneighbors and radius_neighbors as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope my changes are adequate.

@jnothman
Copy link
Member Author

Duplication of elements between iterations of max_depth is always happening

Yes, I've looked at that algorithm. But I find the notation confusing as this usually corresponds to sets, not concatenation. I'm happy with what we've got anyway.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) when pulling 0e0e545 on jnothman:lshforest_improvements into 2d6f1c3 on scikit-learn:master.

@ogrisel
Copy link
Member

ogrisel commented Dec 29, 2014

@maheshakya @daniel-vainsencher any further comments?

@daniel-vainsencher
Copy link

When I said that _find_longest_prefix_match may disappear, I mean that I have a different way of doing the whole query that I think will be faster and much simpler. But since that is so far untested, tweaking the current algorithms is not bad.

The topic of when exactly to eliminate duplication is a bit tricky, I don't have any obviously good advice. So, not much to contribute at this time...

@jnothman
Copy link
Member Author

When I said that _find_longest_prefix_match may disappear, I mean that I have a different way of doing the whole query that I think will be faster and much simpler

Do you mean a technique other than LSHForest with sorted arrays? Or a faster implementation of the latter? I know there are faster ways to implement it using Cython, but I'd rather see what we can get while staying as native as possible.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) when pulling 4735caa on jnothman:lshforest_improvements into 2d6f1c3 on scikit-learn:master.

@daniel-vainsencher
Copy link

Hi Joel,

I have a bunch of related ideas on how to speed this up while
essentially retaining the data-structure, but changing the algorithms
(not just micro optimizations).

Maheshakya, Robert and I wanted to explore some of them, and try to get
the fastest ANN in the west, in Python! (and maybe publish it).

If you feel like trying one out (described for a single index, for
simplicity):

Stop treating the sorted arrays as trees


Use the same data structure, binary search for the whole query (not just
a prefix) for the location (denote l) the query would have in the array.
Then take the min_candidates directly before and after l as the initial
set of candidates.

Disadvantage: you now have 2x as many candidates as you wanted; if your
query was X01111, then up to half might be X10000 and thus have much
higher hamming distance than you were aiming for. And calculating true
distances for bogus candidates is expensive.
Solution 1 (of n): before using the true (say, Euclidean) distance, take
the best min_candidates by Hamming distance to binary query.

Advantages:

  • Just one (max two) binary searches per tree.
  • Add all candidates to the list at once.
  • Simple (to start with)

Anyway, this is very much experimental, so I'd open a separate PR.

Daniel

On 12/29/2014 02:42 PM, jnothman wrote:

When I said that _find_longest_prefix_match may disappear, I mean
that I have a different way of doing the whole query that I think
will be faster and much simpler

Do you mean a technique other than |LSHForest| with sorted arrays? Or a
faster implementation of the latter? I know there are faster ways to
implement it using Cython, but I'd rather see what we can get while
staying as native as possible.


Reply to this email directly or view it on GitHub
#3991 (comment).

@ogrisel
Copy link
Member

ogrisel commented Dec 29, 2014

Alright. Let's merge this PR then an explore with experimental algorithmic improvements in other PR (or maybe even in a 3rd party project if it's very different from the published method).

ogrisel added a commit that referenced this pull request Dec 29, 2014
[MRG+1] LSHForest: sparse support and vectorised _find_longest_prefix_match
@ogrisel ogrisel merged commit 67ca4ef into scikit-learn:master Dec 29, 2014
@ogrisel
Copy link
Member

ogrisel commented Dec 29, 2014

Thanks @jnothman!

@ogrisel
Copy link
Member

ogrisel commented Dec 29, 2014

Disadvantage: you now have 2x as many candidates as you wanted;

@daniel-vainsencher why not just collect min_samples / 2 before and after then? I must have missed something.

@jnothman
Copy link
Member Author

@daniel-vainsencher, I've seen similar approaches used in applied literature (don't ask me where) that just take a quantity of context. It may be fairer to take min_candidates context, then take any further that have the same size prefix overlap (or do we rely on tree redundancy for that?). But now you've got me wondering whether that's actually equivalent to what we're doing...

@daniel-vainsencher
Copy link

Again, consider a query X01111, then the entries before it will have
codes like X01111, X01110, X01101 etc, and that is fine. The entries
after will have X10000, X10001, X10010 etc, which are quite far in
Hamming distance, therefore the corresponding points are far in expected
"true" distance. What this means is that under the proposed scheme, up
to half of the candidates found will be somewhat bad ones. This isn't
necessarily terrible: the Hamming "damage" on that half is n with
probability 2^-n for random queries... but the worst case is half your
candidates are useless.

By taking 2x min_candidates, you get precision that is always as good as
we got from LSHF with min_candidates; then the issue is "merely" to get
back the wasted time. This is a significant issue: the distance
calculations in high dimension are often the most expensive part!

Hence I gave the simplest of many candidates routes to deal with the
issue. Since I have never implemented these ideas, I don't know which
routes will work well, but calculating hamming distances doesn't get
more expensive with data dimension, so I'm optimistic.

Daniel

On 12/29/2014 10:37 PM, Olivier Grisel wrote:

Disadvantage: you now have 2x as many candidates as you wanted;

@daniel-vainsencher https://github.com/daniel-vainsencher why not just
collect |min_samples / 2| before and after then? I must have missed
something.


Reply to this email directly or view it on GitHub
#3991 (comment).

@jnothman
Copy link
Member Author

Sorry, I mean with the min_hash_size check as well. I guess it remains a problem in that one direction will be searched and the other blocked.

@daniel-vainsencher
Copy link

@jnothman, I didn't understood exactly what you meant, but there are
many ways to choose what candidates to take.

My main points are:

  • To make the most of the prefixes and sortedness, we need only one
    binary search, and then can constrain ourselves to a local area around it.
  • After having done that, actual hamming distance is a truer
    approximation of "true" distance than shared prefix length.

On 12/29/2014 11:00 PM, jnothman wrote:

@daniel-vainsencher https://github.com/daniel-vainsencher, I've seen
similar approaches used in applied literature (don't ask me where) that
just take a quantity of context. It may be fairer to take
|min_candidates| context, then take any further that have the same size
prefix overlap (or do we rely on tree redundancy for that?). But now
you've got me wondering whether that's actually equivalent to what we're
doing...


Reply to this email directly or view it on GitHub
#3991 (comment).

@jnothman
Copy link
Member Author

To make the most of the prefixes and sortedness, we need only one binary search, and then can constrain ourselves to a local area around it.

Of course. The series of binary searches is obviously unnecessary, even in the forest-over-sorted-arrays framework. But as far as I can determine we're not going to get a much faster implementation within the native Python / numpy framework.

After having done that, actual hamming distance is a truer approximation of "true" distance than shared prefix length

Yes, and I think that there should be an option to calculate these distances rather than the exact metric, as in the last point at #3988.

I don't think it would be wise to venture into new algorithm territory for this implementation, though; only new efficiency strategies. In terms of the state of the LSHForest implementation, I think the priorities are:

  1. Support Euclidean approximation and make it default, ideally before the next scikit-learn release.
  2. Ensure that LSHForest is competitive or faster than exact nearest neighbors for some KNeighborClassification task or similar.
  3. make it flexible to user metrics and hashers.
  4. Optimise it more.

So this PR was attempting to work towards the modest goal of 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants