-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] LSHForest: sparse support and vectorised _find_longest_prefix_match #3991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hmmm... it seems I've broken something in fdc1158 (of course, the commit labelled |
fd61994
to
ed5c63f
Compare
And there seems to be some substantial speed-up in vectorising |
At master, the benchmark (using the script as updated here, averaged over 10 evaluation loops) output includes
Now the corresponding timings are:
|
I've played around with vectorising distance calculations (see jnothman@bb244c7) and at least in the dense case where number of samples and candidates get big, it is slower than the baseline. It might be faster if queries were batched. |
@maheshakya, you might want to look at this. |
What does STY stand for? |
@daniel-vainsencher you might be interested in this as well. |
res = mid | ||
else: | ||
hi = mid | ||
hi = np.empty_like(query, dtype=int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to use dtype=np.intp
for index arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I actually wasn't sure what the policy for dtype=int
is.
LGTM! Thanks for the optim & sparse support! |
Hi everyone, lost track for a bit. _find_longest_prefix_match, while part of the canon (the LSHF paper), can probably be simplified away eventually. I don't think that assuming queries are batched is particularly likely to help much with LSHF... did that get significant speedup? |
max_depth = max_depth - 1 | ||
candidate_set.update(candidates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think now this only requires candidate_set.update(self.original_indices_[i][start:stop])
in the loop if the candidates
is to be dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And candidate_set
can go just by the name candidates
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But again it's still worth to compare the costs of extending a list several times and updating a set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
candidate_set
gives the sense of distinct
as opposed to the number that min_candidates
needs to be compared to.
@jnothman thanks for adding sparse support. I too have tried vectorizing distance calculations as you've mentioned but didn't get any speed up. And as Daniel said we cannot always expect batched queries in applications. |
Most of the gain shown in #3991 (comment) is from vectorizing
Why not? Queries are batched in |
Is Style... I've seen it somewhere else... it's a less confusing version of COSMIT, but i'm not sure why I used it. |
I'm not certain @maheshakya, your clarification would be welcome. Let's say we have a single tree. Currently |
Apart from that and the |
Duplication of elements between iterations of
Where Duplication due to different trees is also possible. |
@@ -319,7 +343,7 @@ def fit(self, X): | |||
Returns self. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a word about support for sparse (CSR) matrix
in X
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and in kneighbors
and radius_neighbors
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope my changes are adequate.
Yes, I've looked at that algorithm. But I find the |
@maheshakya @daniel-vainsencher any further comments? |
When I said that _find_longest_prefix_match may disappear, I mean that I have a different way of doing the whole query that I think will be faster and much simpler. But since that is so far untested, tweaking the current algorithms is not bad. The topic of when exactly to eliminate duplication is a bit tricky, I don't have any obviously good advice. So, not much to contribute at this time... |
Do you mean a technique other than |
Hi Joel, I have a bunch of related ideas on how to speed this up while Maheshakya, Robert and I wanted to explore some of them, and try to get If you feel like trying one out (described for a single index, for Stop treating the sorted arrays as trees Use the same data structure, binary search for the whole query (not just Disadvantage: you now have 2x as many candidates as you wanted; if your Advantages:
Anyway, this is very much experimental, so I'd open a separate PR. Daniel On 12/29/2014 02:42 PM, jnothman wrote:
|
Alright. Let's merge this PR then an explore with experimental algorithmic improvements in other PR (or maybe even in a 3rd party project if it's very different from the published method). |
[MRG+1] LSHForest: sparse support and vectorised _find_longest_prefix_match
Thanks @jnothman! |
@daniel-vainsencher why not just collect |
@daniel-vainsencher, I've seen similar approaches used in applied literature (don't ask me where) that just take a quantity of context. It may be fairer to take |
Again, consider a query X01111, then the entries before it will have By taking 2x min_candidates, you get precision that is always as good as Hence I gave the simplest of many candidates routes to deal with the Daniel On 12/29/2014 10:37 PM, Olivier Grisel wrote:
|
Sorry, I mean with the |
@jnothman, I didn't understood exactly what you meant, but there are My main points are:
On 12/29/2014 11:00 PM, jnothman wrote:
|
Of course. The series of binary searches is obviously unnecessary, even in the forest-over-sorted-arrays framework. But as far as I can determine we're not going to get a much faster implementation within the native Python / numpy framework.
Yes, and I think that there should be an option to calculate these distances rather than the exact metric, as in the last point at #3988. I don't think it would be wise to venture into new algorithm territory for this implementation, though; only new efficiency strategies. In terms of the state of the LSHForest implementation, I think the priorities are:
So this PR was attempting to work towards the modest goal of 2. |
This adds sparse matrix support to
LSHForest
, vectorises calls tohasher.transform
, and vectorises_find_longest_prefix_match
over queries. These seem to speed things up a little, but the benchmark script does not provide very stable timings for me.Some other query operations cannot be easily vectorised, such as gathering the set of candidates per query (which differ in cardinality). Unvectorised operations make sparse matrix calculations particularly inefficient (because extracting a single row is not especially cheap).