Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MAINT: do not use copyswap in where internals #23770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 18, 2023

Conversation

ngoldbaum
Copy link
Member

@ngoldbaum ngoldbaum commented May 16, 2023

This makes it possible to do e.g. np.where(condition, string_array, other_string_array) for StringDType.

Also added a new benchmark for the code path I modified. It ends up being 33% slower for object arrays and 10% slower for the non-object case:

       before           after         ratio
     [9d08632f]       [f8b1a3e4]
     <rm-copyswap-in-where~1>       <rm-copyswap-in-where>
+      82.8±0.8μs        110±0.8μs     1.33  bench_function_base.Where.time_2_object
+      14.2±0.2μs       15.6±0.2μs     1.10  bench_function_base.Where.time_2
-      75.5±0.7μs       71.5±0.6μs     0.95  bench_function_base.Where.time_interleaved_zeros_x8

Not sure about the last change, I think that might be noise?

@ngoldbaum
Copy link
Member Author

The latest push no longer has a slowdown in the fast path:

       before           after         ratio
     [9d08632f]       [46cf47d1]
     <rm-copyswap-in-where~2>       <rm-copyswap-in-where>
+      80.9±0.5μs          110±2μs     1.36  bench_function_base.Where.time_2_object
-        78.2±6μs       73.7±0.8μs     0.94  bench_function_base.Where.time_all_ones
-        16.0±1μs       14.7±0.2μs     0.92  bench_function_base.Where.time_2_broadcast

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest push no longer has a slowdown in the fast path:

I would think that is just random. That code shouldn't add 4us overhead, I would buy up to 100ns if you are insisting. I don't care about removing it, but the iterator construct will add much more overhead anyway.

It might be nice to clean up that needs_api. But I think consolidating the two dtype transfer functions into one is useful. There is no swapping. You can add NPY_ITER_ALIGNED if you don't like using unaligned access (will force buffering if unaligned, but ensure you can assume aligned.)

&y_cast_info, &y_transfer_flags) != NPY_SUCCEED) {
goto fail;
}

NPY_BEGIN_THREADS_NDITER(iter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, maybe we can make this precise here:

transfer_flags = PyArrayMethod_COMBINED_FLAGS(x_transfer_flags, y_transfer_flags);
transfer_flags = PyArrayMethod_COMBINED_FLAGS(
        transfer_flags, PyArrayNpyIter_GetTransferFlags(iter));
if (!(cast_flags & NPY_METH_REQUIRES_PYAPI)) {
    NPY_BEGIN_THREADS_THRESHOLDED(NpyIter_GetIterSize(iter));
}

in principle we should do this here (and probably in a few other places).

y_is_aligned, ystrides[0], ystrides[1], dty, common_dt, 0,
&y_cast_info, &y_transfer_flags) != NPY_SUCCEED) {
goto fail;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically a point if one of them is not aligned, but the iterator probably even ensures that.

We really only need a single transfer function here, I think. The old code is unnecessarily complex: the iterator ensures that swapping cannot possibly be necessary. (may not be always ideal, but that is what it does.)

@ngoldbaum
Copy link
Member Author

Thanks for the suggestions, especially the tip to use NPY_ITER_ALIGNED, it didn't occur to me to use the iterator to ensure the casts are exactly the same.

@ngoldbaum
Copy link
Member Author

Here are the asv timings on my laptop with the latest push:

+        79.1±2μs          106±6μs     1.34  bench_function_base.Where.time_2_object
+      11.6±0.2μs       12.4±0.1μs     1.07  bench_function_base.Where.time_1
+     13.9±0.09μs      14.7±0.07μs     1.06  bench_function_base.Where.time_2

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the trouble of simplifying the code! The part of the strides is (technically) simplifying it too much unfortunately.

You can try that test (or just add it anyway), but I don't want to try to waste time trying to find a test that fails reliably...

Comment on lines 3440 to 3442
npy_intp cstride = strides[1];
npy_intp xstride = strides[2];
npy_intp ystride = strides[3];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argg, sorry. I had not noticed this. The above strides * is good, the iterator keeps using that. But we cannot do this part, because the iterator may mutate it.

(I doubt this can happen for this iterator, and it would require broadcasting at the very least. I will also go on record that I suspect that the mechanism for why it changes, using the buffers differently, is probably a lot of complexity for almost no or even negative gain.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have ideas on what kind of test would be necessary, something like:

a = np.ones((100, 10000), dtype="f4")
b = np.ones(50000)[::5]  # non-itemsize stride

because it doesn't cast, but its larger than the buffer size along the last dimension and the normal stride is larger than the buffer stride.
(Wrong dtype for a to force buffer/cast on that op, so that growing the loop isn't trivial, dunno if it matters.)

But, you can get lost for a week in that logic, so no worries either way, the above is just a guess...

Comment on lines 3434 to 3437
PyArray_Descr **dts = NpyIter_GetDescrArray(iter);
PyArray_Descr *dtx = dts[2];
PyArray_Descr *dty = dts[3];
npy_intp itemsize = dts[0]->elsize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PyArray_Descr **dts = NpyIter_GetDescrArray(iter);
PyArray_Descr *dtx = dts[2];
PyArray_Descr *dty = dts[3];
npy_intp itemsize = dts[0]->elsize;
npy_intp itemsize = common_dt->elsize;

Use common_dt everywhere. The iterator enforces that after all (if it didn't we would actually have to do the cast below. To be fair, when we don't use that "trivial copy" fast-path, that could be faster even, but that is a different thing than this PR).


NPY_ARRAYMETHOD_FLAGS transfer_flags = 0;

npy_intp transfer_strides[2] = {xstride, itemsize};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
npy_intp transfer_strides[2] = {xstride, itemsize};
npy_intp transfer_strides[2] = {itemsize, itemsize};

We copy only a single item anyway, so it doesn't matter.

@ngoldbaum
Copy link
Member Author

Applied your suggestions. I tried a little bit to get a failing test, but I think the iterator is already enforcing that the strides are the same before we get to the where inner loop. In any case, your suggestions make things clearer.

@seberg
Copy link
Member

seberg commented May 17, 2023

Can you please change the stride array unpacking back, so that it is inside the iternext() loop?? I understand that finding an example is either too hard or even impossible; but the strides are not guaranteed to be fixed by the iterator (there is a reason a GetFixedStridesArray function exists).

@ngoldbaum
Copy link
Member Author

Can you please change the stride array unpacking back, so that it is inside the iternext() loop??

Ah, sorry, I didn't appreciate that was the issue.

Comment on lines 3435 to 3437

int swap = PyDataType_ISBYTESWAPPED(common_dt);
int native = (swap == 0) && !needs_api;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int swap = PyDataType_ISBYTESWAPPED(common_dt);
int native = (swap == 0) && !needs_api;
int has_ref = PyDataType_REFCHK(common_dt);

Oh lets just delete it, this is meaningless. It can never be swapped.

The dtype is forced to be the same, swapping is impossible but even
if it was possible, it wouldn't matter here.
Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM. I took the liberty to push a (tiny) follow up commit to clean things out a bit. (no need for swap check, and we use refchk everywhere else for the trivial copy paths, so it's nice to do that explicitly here too I think).

(will probably just merge tomorrow, unless anyone beats me to it)

@mattip
Copy link
Member

mattip commented May 17, 2023

Could one of you rerun the relevant benchmarks on the final version?

if (cast_info.func(
&cast_info.context, args, &one,
transfer_strides, cast_info.auxdata) < 0) {
goto fail;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a path where has_ref is false but we get here because of strange itemsize, and then we call this function without the GIL (after NPY_BEGIN_THREADS_THRESHOLDED) (also cast_info below)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go here, we account for whether or not the GIL is needed (e.g. the GIL should be released for the string dtype prototype). That is done via the transfer_flags.

The only thing we assume is that if !has_ref is true than a value of that dtype can be copied via a memcpy (or pointer assignment really). At this point the function call never does a cast and only a copy.

@seberg
Copy link
Member

seberg commented May 17, 2023

Could one of you rerun the relevant benchmarks on the final version?

we didn't really touch the core, so I think any fluctuations will be random, but maybe @ngoldbaum you can re-run it?

@ngoldbaum
Copy link
Member Author

ngoldbaum commented May 17, 2023

Could one of you rerun the relevant benchmarks on the final version?

I only see a change in the new benchmark I added:


       before           after         ratio
     [9d08632f]       [01a251ba]
     <rm-copyswap-in-where~6>       <rm-copyswap-in-where>
+        81.3±1μs          106±3μs     1.31  bench_function_base.Where.time_2_object

This was with Python 3.11 on my laptop with no special configuration for benchmarking.

@mattip
Copy link
Member

mattip commented May 18, 2023

I guess the slowdown makes sense since there is now a function call.

@mattip mattip merged commit cebb7a6 into numpy:main May 18, 2023
@mattip
Copy link
Member

mattip commented May 18, 2023

Thanks @ngoldbaum, @seberg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants