-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
MAINT: do not use copyswap in where internals #23770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest push no longer has a slowdown in the fast path:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest push no longer has a slowdown in the fast path:
I would think that is just random. That code shouldn't add 4us overhead, I would buy up to 100ns if you are insisting. I don't care about removing it, but the iterator construct will add much more overhead anyway.
It might be nice to clean up that needs_api
. But I think consolidating the two dtype transfer functions into one is useful. There is no swapping. You can add NPY_ITER_ALIGNED
if you don't like using unaligned access (will force buffering if unaligned, but ensure you can assume aligned.)
&y_cast_info, &y_transfer_flags) != NPY_SUCCEED) { | ||
goto fail; | ||
} | ||
|
||
NPY_BEGIN_THREADS_NDITER(iter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, maybe we can make this precise here:
transfer_flags = PyArrayMethod_COMBINED_FLAGS(x_transfer_flags, y_transfer_flags);
transfer_flags = PyArrayMethod_COMBINED_FLAGS(
transfer_flags, PyArrayNpyIter_GetTransferFlags(iter));
if (!(cast_flags & NPY_METH_REQUIRES_PYAPI)) {
NPY_BEGIN_THREADS_THRESHOLDED(NpyIter_GetIterSize(iter));
}
in principle we should do this here (and probably in a few other places).
y_is_aligned, ystrides[0], ystrides[1], dty, common_dt, 0, | ||
&y_cast_info, &y_transfer_flags) != NPY_SUCCEED) { | ||
goto fail; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically a point if one of them is not aligned, but the iterator probably even ensures that.
We really only need a single transfer function here, I think. The old code is unnecessarily complex: the iterator ensures that swapping cannot possibly be necessary. (may not be always ideal, but that is what it does.)
Thanks for the suggestions, especially the tip to use |
Here are the asv timings on my laptop with the latest push:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the trouble of simplifying the code! The part of the strides is (technically) simplifying it too much unfortunately.
You can try that test (or just add it anyway), but I don't want to try to waste time trying to find a test that fails reliably...
npy_intp cstride = strides[1]; | ||
npy_intp xstride = strides[2]; | ||
npy_intp ystride = strides[3]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argg, sorry. I had not noticed this. The above strides *
is good, the iterator keeps using that. But we cannot do this part, because the iterator may mutate it.
(I doubt this can happen for this iterator, and it would require broadcasting at the very least. I will also go on record that I suspect that the mechanism for why it changes, using the buffers differently, is probably a lot of complexity for almost no or even negative gain.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have ideas on what kind of test would be necessary, something like:
a = np.ones((100, 10000), dtype="f4")
b = np.ones(50000)[::5] # non-itemsize stride
because it doesn't cast, but its larger than the buffer size along the last dimension and the normal stride is larger than the buffer stride.
(Wrong dtype for a
to force buffer/cast on that op, so that growing the loop isn't trivial, dunno if it matters.)
But, you can get lost for a week in that logic, so no worries either way, the above is just a guess...
PyArray_Descr **dts = NpyIter_GetDescrArray(iter); | ||
PyArray_Descr *dtx = dts[2]; | ||
PyArray_Descr *dty = dts[3]; | ||
npy_intp itemsize = dts[0]->elsize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyArray_Descr **dts = NpyIter_GetDescrArray(iter); | |
PyArray_Descr *dtx = dts[2]; | |
PyArray_Descr *dty = dts[3]; | |
npy_intp itemsize = dts[0]->elsize; | |
npy_intp itemsize = common_dt->elsize; |
Use common_dt
everywhere. The iterator enforces that after all (if it didn't we would actually have to do the cast below. To be fair, when we don't use that "trivial copy" fast-path, that could be faster even, but that is a different thing than this PR).
|
||
NPY_ARRAYMETHOD_FLAGS transfer_flags = 0; | ||
|
||
npy_intp transfer_strides[2] = {xstride, itemsize}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
npy_intp transfer_strides[2] = {xstride, itemsize}; | |
npy_intp transfer_strides[2] = {itemsize, itemsize}; |
We copy only a single item anyway, so it doesn't matter.
Applied your suggestions. I tried a little bit to get a failing test, but I think the iterator is already enforcing that the strides are the same before we get to the where inner loop. In any case, your suggestions make things clearer. |
Can you please change the stride array unpacking back, so that it is inside the |
Ah, sorry, I didn't appreciate that was the issue. |
|
||
int swap = PyDataType_ISBYTESWAPPED(common_dt); | ||
int native = (swap == 0) && !needs_api; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int swap = PyDataType_ISBYTESWAPPED(common_dt); | |
int native = (swap == 0) && !needs_api; | |
int has_ref = PyDataType_REFCHK(common_dt); |
Oh lets just delete it, this is meaningless. It can never be swapped.
The dtype is forced to be the same, swapping is impossible but even if it was possible, it wouldn't matter here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM. I took the liberty to push a (tiny) follow up commit to clean things out a bit. (no need for swap check, and we use refchk everywhere else for the trivial copy paths, so it's nice to do that explicitly here too I think).
(will probably just merge tomorrow, unless anyone beats me to it)
Could one of you rerun the relevant benchmarks on the final version? |
if (cast_info.func( | ||
&cast_info.context, args, &one, | ||
transfer_strides, cast_info.auxdata) < 0) { | ||
goto fail; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a path where has_ref
is false but we get here because of strange itemsize
, and then we call this function without the GIL (after NPY_BEGIN_THREADS_THRESHOLDED
) (also cast_info
below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go here, we account for whether or not the GIL is needed (e.g. the GIL should be released for the string dtype prototype). That is done via the transfer_flags
.
The only thing we assume is that if !has_ref
is true than a value of that dtype can be copied via a memcpy
(or pointer assignment really). At this point the function call never does a cast and only a copy.
we didn't really touch the core, so I think any fluctuations will be random, but maybe @ngoldbaum you can re-run it? |
I only see a change in the new benchmark I added:
This was with Python 3.11 on my laptop with no special configuration for benchmarking. |
I guess the slowdown makes sense since there is now a function call. |
Thanks @ngoldbaum, @seberg |
This makes it possible to do e.g.
np.where(condition, string_array, other_string_array)
forStringDType
.Also added a new benchmark for the code path I modified. It ends up being 33% slower for object arrays and 10% slower for the non-object case:
Not sure about the last change, I think that might be noise?