ENH: optimizing np.searchsorted and adding benchmarks #30517
+118
−30
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi! I noticed that
np.searchsortedcan be optimized.This PR optimizes C++
binsearchimplementation used bynp.searchsortedfor u64 and adds benchmarks (I haven't reworked the otherbinsearchimplementations yet).The main idea is to express the binary search in terms of a range
[base, base+legth]where the length halves on each iteration. This makes each key to only need a singlebasepointer for intermediate computations. The PR uses theretarray to store these intermediate computations so no extra memory is needed (base eventually becomes the result after the last iteration).The lengths used are only dependent on the initial length of the array, which allows us to batch each intermediate step of the algorithm for all keys. Basically, each key is processed against the same set of lengths.
This ends up being faster for two reasons:
A numpy vectorized version
We could also implement this algorithm in Python by just relying on numpy array broadcasting:
A quick benchmark shows this approach to be significantly faster than current implementation when quering multiple keys:
Which is 15x faster (numpy beats numpy!). Although it does not beat the C++ implementation from the PR:
Drawbacks
The main drawback of this approach is that we do
length = ceil(length / 2)on each iteration. The PR algorithm always does exactlyceil(log(lenght))iterations whereas the current implementation might only requirefloor(log(lenght))in some cases. This extra overhead is what most-likely explains why PR's implementation is ~40ns slower for the single-key&big-array benchmarks.Results