Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MAINT,ENH: Rewrite scalar math logic #21188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
May 7, 2022
Merged

Conversation

seberg
Copy link
Member

@seberg seberg commented Mar 13, 2022

This tries to redo the scalar math logic to take some more care about subclasses, but most of all introduce logic to defer to the other if self can be cast to it safely (i.e. it is the correct promotion).

This makes things much faster and more reliable, since we now use defer much less to ufuncs indirectly. This ensures that integer overflows are reported more reliably.

Another major point about it, is that this reorganizes the coercion of Python int, float, complex (and bool). This should help a bit with switching to "weak" Python scalars.

Further, it contains a commit to make all macros to inline functions and move the floating point overflow flag handling to a return value. Checking floating point flags is not insanely slow, but it is pretty slow on the scale of integer operations here (~30% on my computer).

It should fix some bugs around subclasses, but then subclassing scalars should still be pretty prone to issues (similar to subclassing arrays I guess). Complicated subclass cases may still end up in the generic (array) path, but we catch a few more ahead of time.


This is the 2-3 approach I have been thinking about. Another one would be to have (more or less) a single function dealing with any scalar inputs. It does "inline" the casting logic here. I do not like that, but it seemed somewhat straight forward – it would be nice to create npy_cast_@from@_to_@to@ or so functions, to use here more generically.

The alternative would be logic with the existing "cast" functionality, but it should be slower, and while verbose (and the current macros are ugly), I am not sure that would actually end up much better.

The PyArray_ScalarFromObject seemed only useful for the old scalar paths, so I added a deprecation: Neither were even documented, and both would probably need some work to transition to the new DTypes well.
PyArray_CastScalarDirect may also be a target, but it is still used in at least one place, so likely should just fix it.


Benchmarks:

       before           after         ratio
     [732ed25e]       [3b60effc]
     <main>           <scalar-math-rewrite>
-        861±10ns          819±6ns     0.95  bench_scalar.ScalarMath.time_abs('int32')
-         862±3ns          819±5ns     0.95  bench_scalar.ScalarMath.time_abs('float16')
-        929±10ns         882±10ns     0.95  bench_scalar.ScalarMath.time_addition('complex128')
-     1.05±0.02μs          996±4ns     0.95  bench_scalar.ScalarMath.time_addition('float16')
-         859±4ns          813±3ns     0.95  bench_scalar.ScalarMath.time_abs('int16')
-         890±8ns          841±2ns     0.94  bench_scalar.ScalarMath.time_abs('complex64')
-         924±4ns          871±4ns     0.94  bench_scalar.ScalarMath.time_multiplication('float64')
-         957±2ns        895±0.8ns     0.93  bench_scalar.ScalarMath.time_abs('complex256')
-        932±10ns        871±0.4ns     0.93  bench_scalar.ScalarMath.time_addition('float64')
-         922±5ns        855±0.8ns     0.93  bench_scalar.ScalarMath.time_abs('longfloat')
-        893±10ns          826±7ns     0.93  bench_scalar.ScalarMath.time_abs('complex128')
-     1.09±0.07μs          987±8ns     0.90  bench_scalar.ScalarMath.time_multiplication('float16')
-        687±10ns          532±2ns     0.77  bench_scalar.ScalarMath.time_add_int32_other('int32')
-     4.63±0.05μs       3.58±0.1μs     0.77  bench_scalar.ScalarMath.time_addition_pyint('int16')
-     5.58±0.06μs      4.24±0.07μs     0.76  bench_scalar.ScalarMath.time_addition_pyint('float32')
-     5.62±0.08μs       4.22±0.1μs     0.75  bench_scalar.ScalarMath.time_addition_pyint('complex64')
-     1.03±0.01μs          755±9ns     0.73  bench_scalar.ScalarMath.time_add_int32_other('int16')
-     5.76±0.03μs       4.21±0.1μs     0.73  bench_scalar.ScalarMath.time_addition_pyint('float16')
-         922±2ns          662±2ns     0.72  bench_scalar.ScalarMath.time_addition('int16')
-         921±3ns        653±0.5ns     0.71  bench_scalar.ScalarMath.time_multiplication('int64')
-         929±1ns          656±6ns     0.71  bench_scalar.ScalarMath.time_addition('int64')
-         916±8ns          645±4ns     0.70  bench_scalar.ScalarMath.time_multiplication('int16')
-        953±60ns          657±5ns     0.69  bench_scalar.ScalarMath.time_multiplication('int32')
-         934±8ns        642±0.9ns     0.69  bench_scalar.ScalarMath.time_addition('int32')
-     3.54±0.01μs         1.76±0μs     0.50  bench_scalar.ScalarMath.time_power_of_two('longfloat')
-     3.41±0.04μs      1.59±0.01μs     0.47  bench_scalar.ScalarMath.time_power_of_two('float64')
-      3.70±0.1μs         1.65±0μs     0.45  bench_scalar.ScalarMath.time_power_of_two('complex256')
-     2.36±0.05μs         1.04±0μs     0.44  bench_scalar.ScalarMath.time_power_of_two('int64')
-     2.83±0.05μs         1.19±0μs     0.42  bench_scalar.ScalarMath.time_addition_pyint('longfloat')
-     2.87±0.05μs         1.20±0μs     0.42  bench_scalar.ScalarMath.time_addition_pyint('complex256')
-     3.17±0.08μs         1.25±0μs     0.39  bench_scalar.ScalarMath.time_power_of_two('complex128')
-     2.64±0.02μs          992±5ns     0.38  bench_scalar.ScalarMath.time_addition_pyint('complex128')
-     2.17±0.04μs          807±2ns     0.37  bench_scalar.ScalarMath.time_addition_pyint('int64')
-     2.75±0.02μs          993±5ns     0.36  bench_scalar.ScalarMath.time_addition_pyint('float64')
-      14.1±0.2μs      1.24±0.03μs     0.09  bench_scalar.ScalarMath.time_add_int32_other('longfloat')
-      14.6±0.3μs      1.23±0.01μs     0.08  bench_scalar.ScalarMath.time_add_int32_other('complex256')
-      14.1±0.2μs      1.10±0.01μs     0.08  bench_scalar.ScalarMath.time_add_int32_other('float64')
-      14.5±0.1μs      1.11±0.01μs     0.08  bench_scalar.ScalarMath.time_add_int32_other('complex128')
-     13.9±0.07μs          973±6ns     0.07  bench_scalar.ScalarMath.time_add_int32_other('int64')

(Some corner cases may be significantly slower, mainly certain scalar + 0d_array ops, but I am not sure I want to worry about those much.)


This PR became quite big, I may split it up. At this time it relies on gh-21178. It is currently missing some additional tests for the subclass behavior at least, but I would like to check code-coverage on that front as well.

@seberg seberg marked this pull request as draft March 13, 2022 02:05
@seberg seberg force-pushed the scalar-math-rewrite branch 2 times, most recently from 55c2247 to 5c5e2d8 Compare March 14, 2022 15:57
@seberg
Copy link
Member Author

seberg commented Mar 14, 2022

Hmmmpf, somehow going to static inline functions makes clang optimize some floating point errors away for some multiplications.

EDIT: My bad, I bet it really needs to be force-inlined, due to the complex values being structs passed by value.

@seberg
Copy link
Member Author

seberg commented Mar 18, 2022

Puh... Added (tricky) tests and fixed some other bugs:

  • float16 division was just plain crashing
  • small integer division goes to float64 in the ufunc machinery, but went to float32 here. (It may well have been what division did at some point)

I did not fix that complex comparisons behave differently for arrays and scalars (ufuncs use NaN aware order currently?!). It is still a change, since the scalar path will be used a bit more often, but...

Otherwise, these are the scalar-math changes as discussed @mattip. It is still based off gh-21178, though.

(outp)->real = (in1r*rat + in1i)*scl; \
(outp)->imag = (in1i*rat - in1r)*scl; \
} \
} while(0)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note, since this felt a bit more complex with the branches, I just decided to use the ufunc-loop (and include that) directly. But happy to go the other way again.

@charris charris closed this Apr 6, 2022
@charris charris reopened this Apr 6, 2022
@seberg
Copy link
Member Author

seberg commented Apr 21, 2022

We should reactivate this, but I am not sure if the test failures were real, so closing/reopening. (I guess there may ave been something about complex numbers or so.)

@seberg seberg closed this Apr 21, 2022
@seberg seberg reopened this Apr 21, 2022
@seberg seberg force-pushed the scalar-math-rewrite branch 3 times, most recently from b99a043 to 053b933 Compare April 24, 2022 19:33
seberg added 11 commits April 24, 2022 21:47
This commit tries to redo the scalar math logic to take some more
care about subclasses, but most of all introduce logic to defer
to the `other` if `self` can be cast to it safely (i.e. it is
the correct promotion).

This makes things much faster and more reliable, since we now use
defer much less to `ufuncs` indirectly.
This ensures that integer overflows are reported more reliably.

Another major point about it, is that this reorganizes the coercion
of Python int, float, complex (and bool).
This should help a bit with switching to "weak" Python scalars.

This may just be a first step on a longer path...
This significantly speeds pure up integer operations since checking
floating point errors is somewhat slow.
It seems to slow down some clongdouble math a tiny bit, maybe because
the functions don't always get inlined fully.
The function was only used for the scalarmath and is not even
documented, so schedule it for deprecation.
The assert doesn't make sense for richcompare, since it has no
notion of forward.
It is also problematic in the other cases, because e.g. complex
remainder defers (because complex remainder is undefined).  So
`complex % bool` will ask bool, which then will try to defer to
`complex` (even though it is not a forward op).

That is correct, since both should defer to tell Python that the
operation is undefined.
For some reason, some of the clang/macos CI is failing in the
multiply operation becaues it fails to report floating point
overflow errors.
Maybe using `NPY_FINLINE` is enough to keep the old behaviour,
since this seemed not to have beena problem when the function was
a macro.
IIRC, these have to be inlined, because otherwise we pass structs
by value, which does not always work correctly.
... also undefine any complex floor division, because it is
also not defined.
Note that the errors for some of these may be a bit less instructive,
but for now I assume that is OK.
However, note that complex comparisons currently do NOT agree,
but the problem is that we should possibly consider changing the
ufunc rather than the scalar, so not changing it in this PR.
These are complicated, and modifications could probably be allowed
here.  The complexities arise not just from the assymetric behaviour
of Python binary operators, but also because we additionally have
our own logic for deferring sometimes (for arrays).
That is, we may coerce the other object to an array when it is
an "unknown" object.

This may assume that subclasses of our scalars are always valid
"arrays" already (so they never need to be coerced explicitly).
That should be a sound assumption, I think?
PyPy does not seem to replace tp_richcompare in the "C wrapper"
object for subclasses defining `__le__`, etc.
That is understandable, but means that we cannot (easily) figure
out that the subclass should be preferred.
In general, this is a bit of a best effort try anyway, and this
is probably simply OK.  Hopefully, subclassing is rare and
comparing two _different_ subclasses even more so.
@seberg seberg force-pushed the scalar-math-rewrite branch from babb657 to 3b60eff Compare April 28, 2022 11:25
I somewhat think the change for subclass logic is so esoteric, that I am
not even sure a release note is useful...
@seberg
Copy link
Member Author

seberg commented Apr 28, 2022

Btw. when it comes to the floating point error issues on clang... It turns out the problem was moving the the fpe_clear_floatstatus_barrier calls lower. Somehow using out "barrier" seems to have not been enough to ensure the correct order.

Considering that it seemed like it sometimes was correct, I am even wondering if FPEs can be checked out-of-band and it is possible to create race conditions... But that is more of a curiosity, I am not worried that the fix of moving the FPE checking is not reasonable.

…asses

This seems just complicated, PyPy doesn't fully allow it, and CPython's
`int` class does not even attempt it...
So, just remove most of this logic (although, this keeps the "is subclass"
information around, even if that is effectively unused).
@seberg
Copy link
Member Author

seberg commented May 4, 2022

Ping @mattip, not sure how to proceed. Do you want to have another look pair programming or can we go ahead?
Or maybe @ganesh-k13 can also have a look, since he had once started looking at similar things.

@ganesh-k13
Copy link
Member

Thanks for the ping. I'll try to understand this a bit better, was quite lost the first time around :)

*out = a / b;
return NPY_FPE_OVERFLOW;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help me out: when is a < 0 && a == -a for bytes?

Copy link
Member Author

@seberg seberg May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, abs(np.int8(-128)) == -128.

ret = OTHER_IS_UNKNOWN_OBJECT;
}
Py_DECREF(dtype);
return ret;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with this big table for now. At some point when we deprecate value based casting for scalars we can rethink this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, created the issue for that!

scalar_res = op(scalar1, scalar2)
assert_array_equal(scalar_res, res)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay for hypothesis-based tests

@mattip
Copy link
Member

mattip commented May 6, 2022

LGTM. The refactoring looks OK, and the speedups are great.

I assume the benchmarks at the top of the PR are still valid?

Some corner cases may be significantly slower, mainly certain scalar + 0d_array ops ...

Could you add some benchmarks for these and quantify the slowdown?

@seberg
Copy link
Member Author

seberg commented May 6, 2022

The big slowdown is that for:

arr = np.array(1.)
scalar = np.float64(1.)

when the array can be cast to the scalar:

%timeit scalar + arr
# 287 ns ± 0.303 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Used to be much faster (scalar path kicked in), then:

%timeit arr + scalar
# 1.67 µs ± 9.22 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

And for now they both go through the array path. So this is a huge slowdown in that very specific case. The scalar + arr case there is still 1/3 slower than scalar + scalar, since we used to convert the 0-D array to a scalar, and only then try to continue.

I have added a benchmark for it.

@mattip
Copy link
Member

mattip commented May 6, 2022

So this is a huge slowdown in that very specific case

Sorry, I am not following. What was the new benchmark performance before the PR and what is it afterwards? How does the time for the new benchmarks time_add_int32arr_and_other compare to time_add_int32_other ?

@seberg
Copy link
Member Author

seberg commented May 6, 2022

OK, here is the rerun benchmark:

       before           after         ratio
     [4c3958b9]       [c0e71bda]
     <main>           <scalar-math-rewrite>
+     1.28±0.01μs       10.5±0.1μs     8.23  bench_scalar.ScalarMath.time_add_other_and_int32arr('complex128')
+     1.31±0.01μs       9.93±0.3μs     7.58  bench_scalar.ScalarMath.time_add_other_and_int32arr('int64')
+     1.30±0.01μs       9.81±0.2μs     7.54  bench_scalar.ScalarMath.time_add_other_and_int32arr('float64')
+     1.01±0.01μs      7.26±0.08μs     7.21  bench_scalar.ScalarMath.time_add_other_and_int32arr('int32')
+        1.42±0μs       9.85±0.3μs     6.96  bench_scalar.ScalarMath.time_add_other_and_int32arr('complex256')
+     1.45±0.07μs       9.72±0.2μs     6.70  bench_scalar.ScalarMath.time_add_other_and_int32arr('longfloat')
+      11.3±0.2μs       11.9±0.2μs     1.05  bench_scalar.ScalarMath.time_add_int32arr_and_other('complex64')
-     1.06±0.02μs      1.01±0.01μs     0.95  bench_scalar.ScalarMath.time_multiplication('float16')
-      16.1±0.1μs       15.2±0.1μs     0.95  bench_scalar.ScalarMath.time_add_int32_other('float32')
-      12.6±0.3μs       11.9±0.2μs     0.94  bench_scalar.ScalarMath.time_add_other_and_int32arr('float16')
-        1.08±0μs      1.01±0.01μs     0.94  bench_scalar.ScalarMath.time_addition('complex256')
-      16.2±0.2μs       15.3±0.1μs     0.94  bench_scalar.ScalarMath.time_add_int32_other('complex64')
-        943±10ns          887±3ns     0.94  bench_scalar.ScalarMath.time_addition('float32')
-     1.07±0.01μs         1.00±0μs     0.94  bench_scalar.ScalarMath.time_addition('float16')
-         883±4ns          830±7ns     0.94  bench_scalar.ScalarMath.time_abs('complex64')
-        950±10ns          888±3ns     0.93  bench_scalar.ScalarMath.time_multiplication('float64')
-        968±30ns          903±6ns     0.93  bench_scalar.ScalarMath.time_addition('complex64')
-         888±5ns          827±1ns     0.93  bench_scalar.ScalarMath.time_abs('complex128')
-      10.1±0.1μs       9.42±0.2μs     0.93  bench_scalar.ScalarMath.time_add_other_and_int32arr('int16')
-        883±40ns          819±3ns     0.93  bench_scalar.ScalarMath.time_abs('float64')
-         887±4ns         819±10ns     0.92  bench_scalar.ScalarMath.time_abs('float16')
-        881±10ns          812±6ns     0.92  bench_scalar.ScalarMath.time_abs('int64')
-        905±40ns         808±10ns     0.89  bench_scalar.ScalarMath.time_abs('int32')
-        994±30ns          887±5ns     0.89  bench_scalar.ScalarMath.time_abs('complex256')
-     1.02±0.06μs          896±8ns     0.88  bench_scalar.ScalarMath.time_addition('complex128')
-        983±30ns          859±5ns     0.87  bench_scalar.ScalarMath.time_abs('longfloat')
-        722±40ns          559±5ns     0.77  bench_scalar.ScalarMath.time_add_int32_other('int32')
-        1000±2ns        764±0.9ns     0.76  bench_scalar.ScalarMath.time_add_int32_other('int16')
-      4.54±0.1μs      3.40±0.02μs     0.75  bench_scalar.ScalarMath.time_addition_pyint('int32')
-      5.56±0.1μs      4.03±0.01μs     0.72  bench_scalar.ScalarMath.time_addition_pyint('complex64')
-        946±80ns          679±6ns     0.72  bench_scalar.ScalarMath.time_addition('int64')
-     5.60±0.06μs      4.01±0.03μs     0.72  bench_scalar.ScalarMath.time_addition_pyint('float16')
-        939±20ns          673±8ns     0.72  bench_scalar.ScalarMath.time_multiplication('int64')
-     5.71±0.06μs      4.06±0.06μs     0.71  bench_scalar.ScalarMath.time_addition_pyint('float32')
-      4.75±0.3μs      3.38±0.02μs     0.71  bench_scalar.ScalarMath.time_addition_pyint('int16')
-        938±60ns          665±7ns     0.71  bench_scalar.ScalarMath.time_addition('int32')
-         941±6ns          664±5ns     0.71  bench_scalar.ScalarMath.time_addition('int16')
-        941±10ns          654±7ns     0.70  bench_scalar.ScalarMath.time_multiplication('int32')
-        973±20ns         668±10ns     0.69  bench_scalar.ScalarMath.time_multiplication('int16')
-     3.52±0.06μs      1.78±0.04μs     0.51  bench_scalar.ScalarMath.time_power_of_two('longfloat')
-     3.44±0.02μs      1.69±0.03μs     0.49  bench_scalar.ScalarMath.time_power_of_two('complex256')
-     3.36±0.01μs      1.58±0.01μs     0.47  bench_scalar.ScalarMath.time_power_of_two('float64')
-        2.34±0μs         1.04±0μs     0.44  bench_scalar.ScalarMath.time_power_of_two('int64')
-     2.86±0.05μs      1.20±0.01μs     0.42  bench_scalar.ScalarMath.time_addition_pyint('longfloat')
-     3.12±0.05μs         1.28±0μs     0.41  bench_scalar.ScalarMath.time_power_of_two('complex128')
-     3.03±0.08μs      1.21±0.01μs     0.40  bench_scalar.ScalarMath.time_addition_pyint('complex256')
-     2.19±0.02μs          851±6ns     0.39  bench_scalar.ScalarMath.time_addition_pyint('int64')
-     2.64±0.03μs      1.02±0.01μs     0.39  bench_scalar.ScalarMath.time_addition_pyint('float64')
-     2.80±0.07μs         1.01±0μs     0.36  bench_scalar.ScalarMath.time_addition_pyint('complex128')
-      13.7±0.3μs      1.22±0.01μs     0.09  bench_scalar.ScalarMath.time_add_int32_other('complex256')
-      13.7±0.2μs      1.09±0.02μs     0.08  bench_scalar.ScalarMath.time_add_int32_other('complex128')
-      13.8±0.2μs      1.08±0.02μs     0.08  bench_scalar.ScalarMath.time_add_int32_other('float64')
-      13.3±0.1μs          977±1ns     0.07  bench_scalar.ScalarMath.time_add_int32_other('int64')
-        18.4±5μs      1.21±0.01μs     0.07  bench_scalar.ScalarMath.time_add_int32_other('longfloat')

The time_add_int32arr_and_other do not really show up, because they did not change (they were always slow. The time_add_other_and_int32arr show partially up, because they used to be handled in the scalar path. But now scalar + arr0d is as slow as arr0d + scalar, while before that was not (always) the case.

That slowdown is massive of course, since the array path is much much slower than the scalar path (especially for scalars unfortunately). But, I somewhat doubt that scalar + 0darray is such a super common thing that aligning scalar + array to be the same as array + scalar is something to worry about.

@mattip
Copy link
Member

mattip commented May 6, 2022

Let's put this in since it does clean up the error handling and makes some parts of the code cleaner. We may need to back it out if it turns out 0d arrays are used more than we think, but I agree scalars are likely to be much more common. I will wait a little while to see if anyone else has an opinion.

@seberg
Copy link
Member Author

seberg commented May 6, 2022

We may need to back it out if it turns out 0d arrays are used more than we think, but I agree scalars are likely to be much more common.

Yeah, or re-add special cases for 0-D arrays. Although, in that case it would be nice to at least also make arr + scalar fast and not just scalar + arr. The assymetry adds to my feeling of not wanting to worry much about it.

@mattip mattip merged commit 37cb0f8 into numpy:main May 7, 2022
@mattip
Copy link
Member

mattip commented May 7, 2022

Thanks @seberg

@seberg seberg deleted the scalar-math-rewrite branch May 11, 2022 17:17
seberg pushed a commit that referenced this pull request Jun 13, 2022
Checks condition a == NPY_MIN_@NAME@ to determine whether an overflow error has occurred for np.int8 type. See #21289 and #21188 (comment) for reference.

This also adds error integer overflow handling to the `-scalar` paths and "activates" a test for the unsigned versions.
A few tests are skipped, because the tests were buggy (they never ran).  These paths require followups to fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants